Durley Yalile Ramírez Mendoza
UNIVERSIDAD EAFIT
Resumen
En el presente documento se busca agrupar diez mil un (10.001) pymes colombianas teniendo en cuenta los años 2016, 2017, 2018 y 2019 usando herramientas de machine learning con algoritmos de clúster no supervisados como son los modelos K – Means y clúster Aglomerativo. Estos modelos generarán conglomerados con características financieras similares donde al analizar sus cifras e indicadores se espera que cada clúster tenga una distinción financiera marcada y de esta manera se podrán plantear estrategias de acuerdo al mejor modelo de agrupación según las condiciones económicas de las pymes a las que pertenece cada clúster.
A continuación se describe el código en Python usado para el desarrollo del proyecto:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
%matplotlib inline
pymes =pd.read_excel('./datos/Datos_Ratios.xlsx')
pymes.head()
| Macrosector | Desc_Macrosector | CATEGORIA | Descripción | SUBCATEGORIA | NOMBRE | NIT | PERIODO | Ingresos_operacionales | Costo_ventas | ... | Utilidad_Neta | Margen_Neto | Activo | Pasivo | Patrimonio | Ebitda_ | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2016 | 310130 | 4077.0 | ... | 15955 | 0.051446 | 3419876 | 2499818 | 920058 | 38816 | 0.125160 | 0.730967 | 0.004665 | 0.017341 |
| 1 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2017 | 300000 | 0.0 | ... | 36034 | 0.120113 | 3766237 | 2810145 | 956092 | -70789 | -0.235963 | 0.746141 | 0.009568 | 0.037689 |
| 2 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2018 | 300000 | NaN | ... | 16287 | 0.054290 | 5050305 | 4077926 | 972379 | -40105 | -0.133683 | 0.807461 | 0.003225 | 0.016750 |
| 3 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2019 | 300000 | NaN | ... | 69527 | 0.231757 | 5013624 | 3971718 | 1041906 | 208181 | 0.693937 | 0.792185 | 0.013868 | 0.066731 |
| 4 | 1 | Sector agropecuario | A | AGRICULTURA, GANADERÍA, CAZA, SILVICULTURA Y P... | 145 | Cría de aves de corral | 800000276 | 2016 | 321430952 | 274957420.0 | ... | 268204 | 0.000834 | 141586731 | 94570079 | 47016652 | 9848367 | 0.030639 | 0.667930 | 0.001894 | 0.005704 |
5 rows × 34 columns
La base de datos corresponde a 4004 entradas o filas y 34 variables o columnas
pymes.shape
(40004, 34)
Las 4004 entradas corresponden 1001 empresas en los años 2016, 2017, 2018, 2019.
len(pymes.NIT.unique())
10001
pymes.dtypes
Macrosector int64 Desc_Macrosector object CATEGORIA object Descripción object SUBCATEGORIA int64 NOMBRE object NIT int64 PERIODO int64 Ingresos_operacionales int64 Costo_ventas float64 Utilidad_bruta int64 Margen_Bruto float64 Otros_ingresos int64 Costos_distribución float64 Gastos_admon float64 Otros_gastos float64 Otras_ganancias_operdidas float64 Utilidad_operativa int64 Margen_operativo float64 Ing_financieros float64 Cto_financieros float64 Utilidad_antes_impuestos int64 Margen_antes_impuestos float64 Gasto_impuestos float64 Utilidad_Neta int64 Margen_Neto float64 Activo int64 Pasivo int64 Patrimonio int64 Ebitda_ int64 Margen_Ebitda float64 Indice_endeudamiento float64 ROA float64 ROE float64 dtype: object
# se realiza una copia de la base datos para poder hacer cambios y ajustes
pym_total=pymes.copy()
Se presentan valores nulos en las variables costo de ventas, costos de distribución, gastos administrativos, otros gastos, otras ganancias o perdidas, ingresos financieros, costos financieros y gasto de impuestos.
pymesmissed=pym_total.isnull()
#columnas con valores nulos y su porcentaje
columnas_connull=[]
# porcentajes por columna de valores nulos
porc_valoresnull=[]
for column in pymesmissed.columns.values.tolist():
if ((pymesmissed[column]).value_counts()[0]*100/len(pymesmissed))<100:
columnas_connull.append(column)
porc_valoresnull.append(float("{0:.3f}".format((pymesmissed[column]).value_counts()[1]*100/len(pymesmissed))))
print(column,"% valores no nulos y nulos")
print ((pymesmissed[column]).value_counts()*100/len(pymesmissed))
print("")
print ("lista de columnas con valores nulos: ",columnas_connull)
print ("lista porcentajes con valores nulos: ",porc_valoresnull)
Costo_ventas % valores no nulos y nulos False 83.084192 True 16.915808 Name: Costo_ventas, dtype: float64 Costos_distribución % valores no nulos y nulos False 68.108189 True 31.891811 Name: Costos_distribución, dtype: float64 Gastos_admon % valores no nulos y nulos False 97.947705 True 2.052295 Name: Gastos_admon, dtype: float64 Otros_gastos % valores no nulos y nulos False 86.631337 True 13.368663 Name: Otros_gastos, dtype: float64 Otras_ganancias_operdidas % valores no nulos y nulos True 78.967103 False 21.032897 Name: Otras_ganancias_operdidas, dtype: float64 Ing_financieros % valores no nulos y nulos False 60.211479 True 39.788521 Name: Ing_financieros, dtype: float64 Cto_financieros % valores no nulos y nulos False 74.815018 True 25.184982 Name: Cto_financieros, dtype: float64 Gasto_impuestos % valores no nulos y nulos False 92.333267 True 7.666733 Name: Gasto_impuestos, dtype: float64 lista de columnas con valores nulos: ['Costo_ventas', 'Costos_distribución', 'Gastos_admon', 'Otros_gastos', 'Otras_ganancias_operdidas', 'Ing_financieros', 'Cto_financieros', 'Gasto_impuestos'] lista porcentajes con valores nulos: [16.916, 31.892, 2.052, 13.369, 78.967, 39.789, 25.185, 7.667]
Los valores nulos se convierten en ceros
pym_total=pym_total.fillna(0)
pym_total.isnull().sum()
Macrosector 0 Desc_Macrosector 0 CATEGORIA 0 Descripción 0 SUBCATEGORIA 0 NOMBRE 0 NIT 0 PERIODO 0 Ingresos_operacionales 0 Costo_ventas 0 Utilidad_bruta 0 Margen_Bruto 0 Otros_ingresos 0 Costos_distribución 0 Gastos_admon 0 Otros_gastos 0 Otras_ganancias_operdidas 0 Utilidad_operativa 0 Margen_operativo 0 Ing_financieros 0 Cto_financieros 0 Utilidad_antes_impuestos 0 Margen_antes_impuestos 0 Gasto_impuestos 0 Utilidad_Neta 0 Margen_Neto 0 Activo 0 Pasivo 0 Patrimonio 0 Ebitda_ 0 Margen_Ebitda 0 Indice_endeudamiento 0 ROA 0 ROE 0 dtype: int64
pym_total.dtypes
Macrosector int64 Desc_Macrosector object CATEGORIA object Descripción object SUBCATEGORIA int64 NOMBRE object NIT int64 PERIODO int64 Ingresos_operacionales int64 Costo_ventas float64 Utilidad_bruta int64 Margen_Bruto float64 Otros_ingresos int64 Costos_distribución float64 Gastos_admon float64 Otros_gastos float64 Otras_ganancias_operdidas float64 Utilidad_operativa int64 Margen_operativo float64 Ing_financieros float64 Cto_financieros float64 Utilidad_antes_impuestos int64 Margen_antes_impuestos float64 Gasto_impuestos float64 Utilidad_Neta int64 Margen_Neto float64 Activo int64 Pasivo int64 Patrimonio int64 Ebitda_ int64 Margen_Ebitda float64 Indice_endeudamiento float64 ROA float64 ROE float64 dtype: object
pymesmissed=pym_total.isnull()
#columnas con valores nulos y su porcentaje
columnas_connull=[]
# porcentajes por columna de valores nulos
porc_valoresnull=[]
for column in pymesmissed.columns.values.tolist():
if ((pymesmissed[column]).value_counts()[0]*100/len(pymesmissed))<100:
columnas_connull.append(column)
porc_valoresnull.append(float("{0:.3f}".format((pymesmissed[column]).value_counts()[1]*100/len(pymesmissed))))
print(column,"% valores no nulos y nulos")
print ((pymesmissed[column]).value_counts()*100/len(pymesmissed))
print("")
print ("lista de columnas con valores nulos: ",columnas_connull)
print ("lista porcentajes con valores nulos: ",porc_valoresnull)
lista de columnas con valores nulos: [] lista porcentajes con valores nulos: []
# Histogramas
plt.rcParams["figure.figsize"] = (25,14)
pym_total.hist()
plt.show()
# correlograma para las variables predictoras
sns.pairplot(pym_total, kind="scatter")
plt.show()
# se obervan varias relaciones lineales para analizar
# columnas
pym_total.columns
Index(['Macrosector', 'Desc_Macrosector', 'CATEGORIA', 'Descripción',
'SUBCATEGORIA', 'NOMBRE', 'NIT', 'PERIODO', 'Ingresos_operacionales',
'Costo_ventas', 'Utilidad_bruta', 'Margen_Bruto', 'Otros_ingresos',
'Costos_distribución', 'Gastos_admon', 'Otros_gastos',
'Otras_ganancias_operdidas', 'Utilidad_operativa', 'Margen_operativo',
'Ing_financieros', 'Cto_financieros', 'Utilidad_antes_impuestos',
'Margen_antes_impuestos', 'Gasto_impuestos', 'Utilidad_Neta',
'Margen_Neto', 'Activo', 'Pasivo', 'Patrimonio', 'Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE'],
dtype='object')
Se usan los ratios: 'Margen_Bruto','Margen_operativo','Margen_antes_impuestos','MargenNeto','Ebitda','Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE' Con estos ratios se realiza el análisis de cluster
norm=['Margen_Bruto','Margen_operativo','Margen_antes_impuestos','Margen_Neto','Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE']
# Descripción estadistica de las variables
pym_total[norm].describe()
| Margen_Bruto | Margen_operativo | Margen_antes_impuestos | Margen_Neto | Ebitda_ | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | |
|---|---|---|---|---|---|---|---|---|---|
| count | 40004.000000 | 40004.000000 | 40004.000000 | 4.000400e+04 | 4.000400e+04 | 40004.000000 | 40004.000000 | 40004.000000 | 40004.000000 |
| mean | 0.375370 | -66.330572 | -70.059246 | -5.040509e+02 | 9.732050e+05 | -85.643047 | 0.531142 | 0.017429 | -0.033721 |
| std | 4.159000 | 5699.524289 | 6151.573431 | 8.931566e+04 | 1.015514e+07 | 7074.222069 | 1.254889 | 2.363035 | 13.304603 |
| min | -690.953655 | -704428.000000 | -704428.000000 | -1.783311e+07 | -9.964076e+08 | -759578.000000 | 0.000000 | -452.867472 | -2493.777899 |
| 25% | 0.169148 | 0.019628 | 0.009714 | 3.239875e-03 | 3.028400e+04 | 0.004938 | 0.269141 | 0.002532 | 0.007777 |
| 50% | 0.315824 | 0.070734 | 0.051520 | 3.087238e-02 | 4.482840e+05 | 0.054954 | 0.498212 | 0.023686 | 0.057545 |
| 75% | 0.749898 | 0.164426 | 0.135803 | 8.950329e-02 | 1.291925e+06 | 0.141029 | 0.700747 | 0.063108 | 0.152591 |
| max | 1.000000 | 49963.000000 | 49888.000000 | 1.689641e+04 | 3.546446e+08 | 49963.000000 | 159.477966 | 106.306635 | 163.077837 |
# se realiza una nueva copia de los datos para evitar perdida de los datos transformados y modificados
pyclus=pym_total.copy()
# matriz de correlación
corr_mat =pyclus[norm].corr()
for x in range(len(norm)):
corr_mat.iloc[x,x] = 0.0
corr_mat
| Margen_Bruto | Margen_operativo | Margen_antes_impuestos | Margen_Neto | Ebitda_ | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | |
|---|---|---|---|---|---|---|---|---|---|
| Margen_Bruto | 0.000000 | -0.000269 | -0.000290 | -0.000734 | 0.014023 | -0.000573 | -0.006492 | 0.001604 | 0.001231 |
| Margen_operativo | -0.000269 | 0.000000 | 0.987235 | 0.469891 | 0.002410 | 0.945253 | 0.000608 | 0.000672 | -0.000084 |
| Margen_antes_impuestos | -0.000290 | 0.987235 | 0.000000 | 0.588646 | 0.002319 | 0.924713 | 0.000508 | 0.000738 | -0.000054 |
| Margen_Neto | -0.000734 | 0.469891 | 0.588646 | 0.000000 | 0.000851 | 0.388183 | 0.000856 | 0.000707 | 0.000148 |
| Ebitda_ | 0.014023 | 0.002410 | 0.002319 | 0.000851 | 0.000000 | 0.002422 | -0.009724 | 0.504633 | 0.473613 |
| Margen_Ebitda | -0.000573 | 0.945253 | 0.924713 | 0.388183 | 0.002422 | 0.000000 | 0.001388 | 0.000583 | -0.000068 |
| Indice_endeudamiento | -0.006492 | 0.000608 | 0.000508 | 0.000856 | -0.009724 | 0.001388 | 0.000000 | 0.007892 | -0.001992 |
| ROA | 0.001604 | 0.000672 | 0.000738 | 0.000707 | 0.504633 | 0.000583 | 0.007892 | 0.000000 | 0.907329 |
| ROE | 0.001231 | -0.000084 | -0.000054 | 0.000148 | 0.473613 | -0.000068 | -0.001992 | 0.907329 | 0.000000 |
# Matriz de correlación
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corr_mat,square=True,cbar = True, annot = True, annot_kws = {'size': 8} );
# Máxima correlación entre variables
corr_mat.abs().idxmax()
Margen_Bruto Ebitda_ Margen_operativo Margen_antes_impuestos Margen_antes_impuestos Margen_operativo Margen_Neto Margen_antes_impuestos Ebitda_ ROA Margen_Ebitda Margen_operativo Indice_endeudamiento Ebitda_ ROA ROE ROE ROA dtype: object
# Máxima correlacion
corr_mat.abs().max()
# Se oberva que algunas variables tienen muy alta correlación por encima del 90%
Margen_Bruto 0.014023 Margen_operativo 0.987235 Margen_antes_impuestos 0.987235 Margen_Neto 0.588646 Ebitda_ 0.504633 Margen_Ebitda 0.945253 Indice_endeudamiento 0.009724 ROA 0.907329 ROE 0.907329 dtype: float64
# Grafica donde se observa la fuerte correlacion entre las variables Margen operativo y Margen antes de impuestos
plt.scatter(x="Margen_operativo", y="Margen_antes_impuestos",data=pyclus)
plt.title("Margen_operativo vs Margen antes deimpuestos")
plt.xlabel("Margen_operativo")
plt.ylabel("Margen_antes_impuestos")
plt.show()
Principal Component Analysis (PCA) es un método estadístico que permite simplificar la complejidad de espacios muestrales con muchas dimensiones a la vez que conserva su información
Este método de transformación de variables consiste en hacer cumplir las dos condiciones establecidas previamente, pero teniendo en cuenta únicamente la información de los predictores y no necesariamente la relación de estos con la variable de respuesta. En particular, se debe encontrar los vectores $V_{j}$ y los valores $\lambda_{j}$ que cumplan la siguiente ecuación:
$$S_{x}V_{j}=\lambda_{j}V_{j}$$Donde $S_{x}$ es la matriz de varianza y covarianza de la matriz de predictores ($X$). Finalmente, la matriz de transformación $A$ tiene como columnas a cada uno de los vectores $V_{j}$. Además, $\lambda_{j}$ corresponde a la varianza o información que condensa la variable transformada $Z_{j}$.
El método de PCA permite por lo tanto “condensar” la información aportada por múltiples variables en solo unas pocas componentes.
# se crea una base de datos solo con los ratios:
#'Margen_Bruto','Margen_operativo','Margen_antes_impuestos',
#'Margen_Neto','Ebitda_','Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE'
pynet=pyclus[norm]
Se estandarizan las variables y se transforman a la misma escala eliminando la media y escalando los datos de forma que su varianza sea igual a 1. Con esto se busca que se le asigne igual importancia a cada variable.
scaler = StandardScaler()
scaler.fit(pynet)
XEscalado=scaler.transform(pynet)
Ahora, calcularemos la matriz 𝐴 usando la siguiente línea de código:
p=len(XEscalado[1,])
pca = PCA(n_components=p).fit(XEscalado)
A=pca.components_
Una vez hecho esto, podemos visualizar la varianza explicada por cada componente. Para ello, haremos el siguiente gráfico:
plt.rcParams["figure.figsize"] = (25,14)
varianzaExplicada=pca.explained_variance_
porcentajeVarianzaExplicada=pca.explained_variance_ratio_
plt.plot(range(1,p+1),porcentajeVarianzaExplicada)
plt.xlabel('Componente')
plt.title('Varianza de cada componente')
plt.show()
Como se puede observar, las componentes están ordenadas desde la que más varianza o información contiene, hasta la que menos. Los valores reportados en el anterior gráfico corresponden a los 𝜆𝑗 establecidos con anterioridad.
Sin embargo, generalmente se suele ver el gráfico del porcentaje de la variabilidad explicado por las primeras 𝑗 componentes. Para ello obtener dicho gráfico, ejecutaremos el siguiente código:
plt.plot(range(1,p+1),np.cumsum(porcentajeVarianzaExplicada),color='green', linestyle='dashed', linewidth = 3,
marker='o', markerfacecolor='blue', markersize=12)
plt.xlabel('Cantidad acumulada de componentes principales',fontsize=20)
plt.title('Porcentaje acumulado de varianza explicada',fontsize=20)
plt.plot(5,0.9126, marker="o", color="red")
plt.text(5.2,0.90, '5 componentes, 91% de varianza explicada', fontsize=20, color='black')
plt.xlim(0,9)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)
plt.show()
Para presentarlo mejor, a continuación se muestran ejemplos de la interpretación del anterior gráfico:
La primera componente principal explica alrededor del 35% de la variabilidad o varianza.
Las primeras dos componentes principales explican alrededor del 61% de la variabilidad.
Las primeras 3 componentes principales explican alrededor del 72% de la variabilidad.
Las primeras 4 componentes principales explican alrededor del 83% de la variabilidad.
Las primeras 5 componentes principales explican cerca del 91% de la variabilidad.
De esta manera, se puede ver que con las primeras 5 componentes principales ya se está explicando alrededor del 91% de la variabilidad total. Por lo anterior, escogeremos las 5 primeras componentes principales.
explained_variance =np.cumsum(porcentajeVarianzaExplicada)
explained_variance[4]
0.9126002196336194
# se usan 5 componentes
pca = PCA(n_components=5)
pca.fit(XEscalado)
PCA(n_components=5)
# se escalan y transforman las variables
pca.transform(XEscalado)
array([[-0.02080957, 0.04394546, 0.01200697, -0.21652653, 0.00122939],
[-0.02078972, 0.0468192 , 0.01875702, -0.22729312, 0.00112998],
[-0.02082335, 0.04807276, 0.0531698 , -0.26179262, 0.00120966],
...,
[-0.02131384, -0.06492353, 0.05066076, -0.15789233, 0.00190413],
[-0.02135139, -0.07680599, 0.02636615, -0.01360855, 0.00287973],
[-0.02225564, -0.28203849, -0.02003696, 0.03220388, 0.00499648]])
# Se crea el score con los pca trasnformados
scores_pca=pca.transform(XEscalado)
scores_pca[1]
array([-0.02078972, 0.0468192 , 0.01875702, -0.22729312, 0.00112998])
# Ingresos operacionales
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[pynet["Ingresos_operacionales"]<35572666.2]
pyop["Ingresos_operacionales"].hist()
<AxesSubplot:>
# Costos de distribución
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[(pynet["Costos_distribución"]<3783256.2)]
pyop["Costos_distribución"].hist()
<AxesSubplot:>
# Gastos administración
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[(pynet["Gastos_admon"]<3644160.5)]
pyop["Gastos_admon"].hist()
<AxesSubplot:>
# Costo financieros
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[pynet["Cto_financieros"]<1072960.2]
pyop["Cto_financieros"].hist()
<AxesSubplot:>
# Gastos de impuestos
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[(pynet["Gasto_impuestos"]<888069.60)&(pynet["Gasto_impuestos"]>-300000)]
pyop["Gasto_impuestos"].hist()
<AxesSubplot:>
# Pasivo
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[(pynet["Pasivo"]<21989298.5)]
pyop["Pasivo"].hist(bins=5)
<AxesSubplot:>
# Patrimonio
plt.rcParams["figure.figsize"] = (25,14)
pyop=pynet[(pynet["Patrimonio"]<22559281.6)]
pyop["Patrimonio"].hist(bins=5)
<AxesSubplot:>
El algoritmo K-Means busca agrupar en clústeres o grupos los puntos de datos que tienen una alta similitud entre ellos. En los términos del algoritmo, esta similitud se entiende como lo opuesto de la distancia entre puntos de datos. Cuanto más cerca estén los puntos de datos, más similares y con más probabilidades de pertenecer al mismo clúster serán (Roman, 2019).
El algoritmo de agrupamiento K-Means es un proceso iterativo que mueve los centros de los agrupamientos o centroides a la posición media de sus puntos constituyentes, y reasignando instancias a sus clústeres más cercanos iterativamente hasta que no haya un cambio significativo en el número de clústeres posibles o número de iteraciones alcanzadas (Dangeti, 2017).
La función de K-Means está determinada por la distancia euclidiana (norma cuadrada) entre las observaciones pertenecientes a ese grupo con su respectivo valor centroide.
Pasos del Algoritmo según (Roman, 2019)
a) Primero, necesitamos elegir k, el número de clústeres que queremos que nos encuentren.
b) Luego, el algoritmo seleccionará aleatoriamente los centroides de cada grupo.
c) Se asignará cada punto de datos al centroide más cercano (utilizando la distancia euclídea).
d) Se calculará la inercia del conglomerado.
e) Los nuevos centroides se calcularán como la media de los puntos que pertenecen al centroide del paso anterior. En otras palabras, calculando el error cuadrático mínimo de los puntos de datos al centro de cada clúster, moviendo el centro hacia ese punto.
f) Volver al paso c.
Se realiza el análisis teniendo en cuenta 100 cluster para analizar la inercia y encontrar el numero optimo de cluster.
# Ajuste del modelo
from sklearn.cluster import KMeans
km_list = list()
for clust in range(1,100):
km = KMeans(n_clusters=clust, random_state=42)
km = km.fit(scores_pca)
km_list.append(pd.Series({'clusters': clust,
'inertia': km.inertia_,
'model': km}))
plot_data = (pd.concat(km_list, axis=1)
.T
[['clusters','inertia']]
.set_index('clusters'))
ax = plot_data.plot(marker='o',ls='-')
ax.set_xticks(range(0,101,2))
ax.set_xlim(0,101)
plt.plot(50,2000, marker="o", color="red")
ax.set(xlabel='Cluster', ylabel='Inertia');
Se eligen 50 clúster como numero optimo. Estos 50 clúster se toman para crear una mejor agrupación teniendo en cuenta que son 1001 empresas y se requiere agrupar de la mejor manera, sin embargo se observa que a partir de 12 clúster la disminución de la inercia se mantiene constante.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
clusterNum = 50
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12, max_iter=300)
k_means.fit(scores_pca)
labels = k_means.labels_
print(labels)
[47 47 47 ... 47 17 33]
# se crea la nueva variable con los cluster para cada empresa
pyclus["Cluster"] = labels
pyclus.head()
| Macrosector | Desc_Macrosector | CATEGORIA | Descripción | SUBCATEGORIA | NOMBRE | NIT | PERIODO | Ingresos_operacionales | Costo_ventas | ... | Margen_Neto | Activo | Pasivo | Patrimonio | Ebitda_ | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2016 | 310130 | 4077.0 | ... | 0.051446 | 3419876 | 2499818 | 920058 | 38816 | 0.125160 | 0.730967 | 0.004665 | 0.017341 | 47 |
| 1 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2017 | 300000 | 0.0 | ... | 0.120113 | 3766237 | 2810145 | 956092 | -70789 | -0.235963 | 0.746141 | 0.009568 | 0.037689 | 47 |
| 2 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2018 | 300000 | 0.0 | ... | 0.054290 | 5050305 | 4077926 | 972379 | -40105 | -0.133683 | 0.807461 | 0.003225 | 0.016750 | 47 |
| 3 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2019 | 300000 | 0.0 | ... | 0.231757 | 5013624 | 3971718 | 1041906 | 208181 | 0.693937 | 0.792185 | 0.013868 | 0.066731 | 47 |
| 4 | 1 | Sector agropecuario | A | AGRICULTURA, GANADERÍA, CAZA, SILVICULTURA Y P... | 145 | Cría de aves de corral | 800000276 | 2016 | 321430952 | 274957420.0 | ... | 0.000834 | 141586731 | 94570079 | 47016652 | 9848367 | 0.030639 | 0.667930 | 0.001894 | 0.005704 | 33 |
5 rows × 35 columns
# se ordenan los 50 clúster de acuerdo con su estadistica y percentiles
pymes_estadistica=pyclus.groupby('Cluster').describe(percentiles=[0.25,0.5,0.75,0.9])
pymes_estadistica
| Macrosector | SUBCATEGORIA | ... | ROA | ROE | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | 90% | max | count | ... | max | count | mean | std | min | 25% | 50% | 75% | 90% | max | |
| Cluster | |||||||||||||||||||||
| 0 | 528.0 | 4.151515 | 2.236441 | 1.0 | 2.00 | 4.0 | 5.00 | 8.0 | 9.0 | 528.0 | ... | 0.168541 | 528.0 | -2.039612 | 2.946251 | -16.300286 | -2.518156 | -1.122819 | -0.391797 | -0.072437 | 8.061064 |
| 1 | 1.0 | 7.000000 | NaN | 7.0 | 7.00 | 7.0 | 7.00 | 7.0 | 7.0 | 1.0 | ... | -0.035849 | 1.0 | -0.239560 | NaN | -0.239560 | -0.239560 | -0.239560 | -0.239560 | -0.239560 | -0.239560 |
| 2 | 1.0 | 5.000000 | NaN | 5.0 | 5.00 | 5.0 | 5.00 | 5.0 | 5.0 | 1.0 | ... | -452.867472 | 1.0 | -2493.777899 | NaN | -2493.777899 | -2493.777899 | -2493.777899 | -2493.777899 | -2493.777899 | -2493.777899 |
| 3 | 1.0 | 2.000000 | NaN | 2.0 | 2.00 | 2.0 | 2.00 | 2.0 | 2.0 | 1.0 | ... | -0.305338 | 1.0 | -0.450605 | NaN | -0.450605 | -0.450605 | -0.450605 | -0.450605 | -0.450605 | -0.450605 |
| 4 | 4.0 | 1.000000 | 0.000000 | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 | 1.0 | 4.0 | ... | -0.015352 | 4.0 | 0.007687 | 0.008215 | 0.000240 | 0.000914 | 0.007036 | 0.013810 | 0.015386 | 0.016437 |
| 5 | 1.0 | 8.000000 | NaN | 8.0 | 8.00 | 8.0 | 8.00 | 8.0 | 8.0 | 1.0 | ... | -0.098081 | 1.0 | -0.117968 | NaN | -0.117968 | -0.117968 | -0.117968 | -0.117968 | -0.117968 | -0.117968 |
| 6 | 1.0 | 8.000000 | NaN | 8.0 | 8.00 | 8.0 | 8.00 | 8.0 | 8.0 | 1.0 | ... | -0.006542 | 1.0 | -6.887051 | NaN | -6.887051 | -6.887051 | -6.887051 | -6.887051 | -6.887051 | -6.887051 |
| 7 | 2.0 | 4.500000 | 2.121320 | 3.0 | 3.75 | 4.5 | 5.25 | 5.7 | 6.0 | 2.0 | ... | -0.047120 | 2.0 | -0.322628 | 0.240122 | -0.492421 | -0.407524 | -0.322628 | -0.237732 | -0.186795 | -0.152836 |
| 8 | 1.0 | 2.000000 | NaN | 2.0 | 2.00 | 2.0 | 2.00 | 2.0 | 2.0 | 1.0 | ... | -0.014292 | 1.0 | -0.015169 | NaN | -0.015169 | -0.015169 | -0.015169 | -0.015169 | -0.015169 | -0.015169 |
| 9 | 1.0 | 3.000000 | NaN | 3.0 | 3.00 | 3.0 | 3.00 | 3.0 | 3.0 | 1.0 | ... | 0.291525 | 1.0 | -0.001840 | NaN | -0.001840 | -0.001840 | -0.001840 | -0.001840 | -0.001840 | -0.001840 |
| 10 | 12.0 | 4.750000 | 2.050499 | 2.0 | 2.75 | 5.0 | 7.00 | 7.0 | 7.0 | 12.0 | ... | 0.175532 | 12.0 | 0.050542 | 0.093350 | -0.016074 | -0.002507 | 0.008283 | 0.089306 | 0.100980 | 0.313998 |
| 11 | 3169.0 | 2.963395 | 1.930921 | 1.0 | 2.00 | 2.0 | 3.00 | 6.0 | 9.0 | 3169.0 | ... | 0.481673 | 3169.0 | 0.044024 | 0.108523 | -1.073842 | 0.002483 | 0.026112 | 0.076832 | 0.160127 | 0.833420 |
| 12 | 2.0 | 4.000000 | 2.828427 | 2.0 | 3.00 | 4.0 | 5.00 | 5.6 | 6.0 | 2.0 | ... | -0.012115 | 2.0 | 3.201346 | 4.545540 | -0.012836 | 1.594255 | 3.201346 | 4.808437 | 5.772692 | 6.415528 |
| 13 | 11.0 | 4.909091 | 2.625054 | 2.0 | 2.50 | 5.0 | 6.50 | 9.0 | 9.0 | 11.0 | ... | 9.177673 | 11.0 | 63.846372 | 66.004245 | 0.296902 | 0.890903 | 41.727343 | 119.719126 | 131.869131 | 163.077837 |
| 14 | 11.0 | 4.636364 | 2.802596 | 1.0 | 2.00 | 4.0 | 7.50 | 8.0 | 8.0 | 11.0 | ... | 0.007223 | 11.0 | -0.276117 | 0.635621 | -2.155735 | -0.149300 | -0.063207 | -0.013932 | 0.007504 | 0.028843 |
| 15 | 9.0 | 5.333333 | 2.121320 | 2.0 | 4.00 | 5.0 | 7.00 | 8.0 | 8.0 | 9.0 | ... | -0.034001 | 9.0 | -106.782380 | 88.543219 | -244.476096 | -153.897361 | -140.029146 | -18.219403 | -2.064925 | -0.526515 |
| 16 | 1.0 | 7.000000 | NaN | 7.0 | 7.00 | 7.0 | 7.00 | 7.0 | 7.0 | 1.0 | ... | -0.020763 | 1.0 | -0.021093 | NaN | -0.021093 | -0.021093 | -0.021093 | -0.021093 | -0.021093 | -0.021093 |
| 17 | 7584.0 | 4.380142 | 1.773434 | 1.0 | 3.00 | 5.0 | 5.00 | 7.0 | 9.0 | 7584.0 | ... | 0.336182 | 7584.0 | 0.092393 | 0.150921 | -1.099257 | 0.022571 | 0.080243 | 0.162114 | 0.257255 | 1.011536 |
| 18 | 1.0 | 2.000000 | NaN | 2.0 | 2.00 | 2.0 | 2.00 | 2.0 | 2.0 | 1.0 | ... | -0.659499 | 1.0 | 0.020403 | NaN | 0.020403 | 0.020403 | 0.020403 | 0.020403 | 0.020403 | 0.020403 |
| 19 | 3.0 | 3.666667 | 3.785939 | 1.0 | 1.50 | 2.0 | 5.00 | 6.8 | 8.0 | 3.0 | ... | 0.079261 | 3.0 | -0.037664 | 0.403501 | -0.455053 | -0.231667 | -0.008282 | 0.171030 | 0.278618 | 0.350342 |
| 20 | 1.0 | 8.000000 | NaN | 8.0 | 8.00 | 8.0 | 8.00 | 8.0 | 8.0 | 1.0 | ... | 106.306635 | 1.0 | -3.623118 | NaN | -3.623118 | -3.623118 | -3.623118 | -3.623118 | -3.623118 | -3.623118 |
| 21 | 122.0 | 3.729508 | 2.104842 | 1.0 | 2.00 | 3.0 | 5.00 | 7.0 | 9.0 | 122.0 | ... | 2.837710 | 122.0 | 2.880191 | 9.145976 | 0.009354 | 0.182893 | 0.326627 | 0.624774 | 1.541993 | 48.506863 |
| 22 | 58.0 | 3.948276 | 1.771335 | 2.0 | 2.00 | 5.0 | 5.00 | 6.0 | 9.0 | 58.0 | ... | 0.462602 | 58.0 | 0.257645 | 0.388949 | -0.161899 | 0.018026 | 0.068947 | 0.346849 | 0.887230 | 1.508917 |
| 23 | 2.0 | 4.500000 | 3.535534 | 2.0 | 3.25 | 4.5 | 5.75 | 6.5 | 7.0 | 2.0 | ... | 0.001216 | 2.0 | 0.001761 | 0.001112 | 0.000975 | 0.001368 | 0.001761 | 0.002154 | 0.002390 | 0.002547 |
| 24 | 1.0 | 5.000000 | NaN | 5.0 | 5.00 | 5.0 | 5.00 | 5.0 | 5.0 | 1.0 | ... | -0.023599 | 1.0 | -0.051663 | NaN | -0.051663 | -0.051663 | -0.051663 | -0.051663 | -0.051663 | -0.051663 |
| 25 | 7582.0 | 4.192429 | 1.900515 | 1.0 | 3.00 | 5.0 | 5.00 | 7.0 | 9.0 | 7582.0 | ... | 0.571639 | 7582.0 | 0.079930 | 0.107886 | -0.988204 | 0.020394 | 0.065949 | 0.130584 | 0.205371 | 0.864971 |
| 26 | 1.0 | 1.000000 | NaN | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 | ... | -0.041574 | 1.0 | -0.064342 | NaN | -0.064342 | -0.064342 | -0.064342 | -0.064342 | -0.064342 | -0.064342 |
| 27 | 5.0 | 5.600000 | 3.361547 | 2.0 | 2.00 | 7.0 | 8.00 | 8.6 | 9.0 | 5.0 | ... | 32.210114 | 5.0 | -2.268526 | 5.938144 | -12.870145 | 0.016999 | 0.052323 | 0.541422 | 0.766631 | 0.916771 |
| 28 | 3.0 | 5.000000 | 3.000000 | 2.0 | 3.50 | 5.0 | 6.50 | 7.4 | 8.0 | 3.0 | ... | 0.033225 | 3.0 | -60.500071 | 102.741232 | -179.126854 | -90.774359 | -2.421863 | -1.186680 | -0.445570 | 0.048503 |
| 29 | 386.0 | 4.119171 | 2.111513 | 1.0 | 2.00 | 4.0 | 5.00 | 7.0 | 9.0 | 386.0 | ... | 2.024877 | 386.0 | 0.479891 | 0.918603 | -5.224853 | 0.000000 | 0.237432 | 0.817052 | 1.428359 | 6.486934 |
| 30 | 23.0 | 5.043478 | 2.305690 | 2.0 | 2.50 | 5.0 | 7.00 | 8.0 | 9.0 | 23.0 | ... | 0.366194 | 23.0 | -33.251246 | 41.910448 | -116.285320 | -66.688435 | -1.372963 | 0.564466 | 1.907441 | 2.945604 |
| 31 | 1.0 | 3.000000 | NaN | 3.0 | 3.00 | 3.0 | 3.00 | 3.0 | 3.0 | 1.0 | ... | -0.028487 | 1.0 | -0.102355 | NaN | -0.102355 | -0.102355 | -0.102355 | -0.102355 | -0.102355 | -0.102355 |
| 32 | 7.0 | 5.142857 | 3.236694 | 2.0 | 2.00 | 5.0 | 8.00 | 9.0 | 9.0 | 7.0 | ... | 0.250751 | 7.0 | 0.031426 | 0.040333 | -0.013070 | 0.003740 | 0.030779 | 0.043273 | 0.071781 | 0.108247 |
| 33 | 1953.0 | 4.444956 | 1.954442 | 1.0 | 3.00 | 5.0 | 5.00 | 7.0 | 9.0 | 1953.0 | ... | 0.733251 | 1953.0 | 0.462546 | 1.286958 | -11.668540 | 0.140413 | 0.251427 | 0.446460 | 0.721625 | 15.406677 |
| 34 | 2.0 | 4.500000 | 3.535534 | 2.0 | 3.25 | 4.5 | 5.75 | 6.5 | 7.0 | 2.0 | ... | 0.001130 | 2.0 | -0.006056 | 0.010996 | -0.013831 | -0.009944 | -0.006056 | -0.002168 | 0.000165 | 0.001720 |
| 35 | 480.0 | 4.160417 | 2.206769 | 1.0 | 2.00 | 4.0 | 6.00 | 7.0 | 9.0 | 480.0 | ... | 3.282360 | 480.0 | 0.711030 | 2.371763 | -0.153942 | 0.138948 | 0.294350 | 0.542750 | 0.859271 | 23.581120 |
| 36 | 29.0 | 4.275862 | 2.950745 | 1.0 | 2.00 | 2.0 | 7.00 | 8.0 | 9.0 | 29.0 | ... | 0.176179 | 29.0 | -0.237108 | 0.974108 | -3.565672 | -0.308527 | -0.068566 | 0.059448 | 0.426563 | 1.426098 |
| 37 | 37.0 | 3.864865 | 2.070456 | 1.0 | 2.00 | 3.0 | 5.00 | 7.0 | 8.0 | 37.0 | ... | 0.951975 | 37.0 | 16.266388 | 31.247507 | -7.890747 | 0.336725 | 0.798079 | 0.996601 | 76.667524 | 85.612208 |
| 38 | 3.0 | 5.666667 | 3.214550 | 2.0 | 4.50 | 7.0 | 7.50 | 7.8 | 8.0 | 3.0 | ... | 0.063833 | 3.0 | -372.266847 | 31.130793 | -404.111111 | -387.449121 | -370.787130 | -356.344714 | -347.679265 | -341.902299 |
| 39 | 4698.0 | 3.699021 | 2.076072 | 1.0 | 2.00 | 3.0 | 5.00 | 7.0 | 9.0 | 4698.0 | ... | 0.344789 | 4698.0 | 0.032450 | 0.101769 | -1.626971 | -0.000081 | 0.023107 | 0.073767 | 0.131415 | 0.428196 |
| 40 | 1.0 | 7.000000 | NaN | 7.0 | 7.00 | 7.0 | 7.00 | 7.0 | 7.0 | 1.0 | ... | -32.112300 | 1.0 | 1.059800 | NaN | 1.059800 | 1.059800 | 1.059800 | 1.059800 | 1.059800 | 1.059800 |
| 41 | 3.0 | 6.000000 | 2.645751 | 3.0 | 5.00 | 7.0 | 7.50 | 7.8 | 8.0 | 3.0 | ... | 0.859077 | 3.0 | 0.614455 | 0.656031 | 0.016529 | 0.263578 | 0.510628 | 0.913418 | 1.155092 | 1.316207 |
| 42 | 5022.0 | 4.639188 | 2.038086 | 1.0 | 3.00 | 5.0 | 7.00 | 7.0 | 9.0 | 5022.0 | ... | 0.421395 | 5022.0 | 0.136594 | 0.862573 | -6.045956 | -0.022654 | 0.080247 | 0.256599 | 0.586124 | 11.180883 |
| 43 | 4200.0 | 2.831429 | 1.785113 | 1.0 | 2.00 | 2.0 | 2.00 | 6.0 | 9.0 | 4200.0 | ... | 0.718796 | 4200.0 | 0.024045 | 0.084178 | -1.518815 | 0.000938 | 0.016173 | 0.044535 | 0.097629 | 0.795496 |
| 44 | 15.0 | 4.733333 | 2.313521 | 2.0 | 2.50 | 5.0 | 7.00 | 7.0 | 9.0 | 15.0 | ... | 0.057644 | 15.0 | 0.288547 | 0.515313 | -0.007323 | 0.002835 | 0.050508 | 0.293577 | 1.110379 | 1.493469 |
| 45 | 119.0 | 4.554622 | 2.338685 | 1.0 | 2.00 | 5.0 | 6.00 | 8.0 | 9.0 | 119.0 | ... | 0.290848 | 119.0 | -6.889925 | 10.929489 | -42.383965 | -10.456302 | -1.192195 | -0.207421 | 0.456163 | 13.922424 |
| 46 | 2.0 | 9.000000 | 0.000000 | 9.0 | 9.00 | 9.0 | 9.00 | 9.0 | 9.0 | 2.0 | ... | 0.000000 | 2.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 47 | 2250.0 | 3.248444 | 2.046385 | 1.0 | 2.00 | 2.0 | 5.00 | 7.0 | 9.0 | 2250.0 | ... | 0.406332 | 2250.0 | 0.085106 | 0.212268 | -1.363596 | 0.006584 | 0.053625 | 0.155008 | 0.321747 | 2.302458 |
| 48 | 125.0 | 3.736000 | 2.024623 | 1.0 | 2.00 | 3.0 | 5.00 | 7.0 | 9.0 | 125.0 | ... | 0.738013 | 125.0 | 0.308473 | 0.456927 | -0.596029 | 0.028864 | 0.151513 | 0.421265 | 0.902953 | 2.295235 |
| 49 | 1529.0 | 3.926095 | 2.191525 | 1.0 | 2.00 | 3.0 | 5.00 | 7.0 | 9.0 | 1529.0 | ... | 1.239556 | 1529.0 | 0.241156 | 0.192613 | -0.033010 | 0.115499 | 0.192705 | 0.309999 | 0.467289 | 2.009724 |
50 rows × 270 columns
# se guarda la nueva base de datos con la variable cluster usando el metodo kmeans
pymes_estadistica.to_excel('pymes_estadistica_kmeans_sratios.xlsx', sheet_name='kmeans')
# se copia la base de datos para insertar las 5 componentes principales en la base de datos
py_pca=pyclus.copy()
df_pca_cluster=pd.concat([py_pca.reset_index(drop=True),pd.DataFrame(scores_pca)],axis=1)
# se crean las 5 variables con las 5 componentes principales
df_pca_cluster.columns.values[-5:]=["PCA_1","PCA_2","PCA_3","PCA_4","PCA_5"]
df_pca_cluster.head()
| Macrosector | Desc_Macrosector | CATEGORIA | Descripción | SUBCATEGORIA | NOMBRE | NIT | PERIODO | Ingresos_operacionales | Costo_ventas | ... | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | Cluster | PCA_1 | PCA_2 | PCA_3 | PCA_4 | PCA_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2016 | 310130 | 4077.0 | ... | 0.125160 | 0.730967 | 0.004665 | 0.017341 | 47 | -0.020810 | 0.043945 | 0.012007 | -0.216527 | 0.001229 |
| 1 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2017 | 300000 | 0.0 | ... | -0.235963 | 0.746141 | 0.009568 | 0.037689 | 47 | -0.020790 | 0.046819 | 0.018757 | -0.227293 | 0.001130 |
| 2 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2018 | 300000 | 0.0 | ... | -0.133683 | 0.807461 | 0.003225 | 0.016750 | 47 | -0.020823 | 0.048073 | 0.053170 | -0.261793 | 0.001210 |
| 3 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2019 | 300000 | 0.0 | ... | 0.693937 | 0.792185 | 0.013868 | 0.066731 | 47 | -0.021013 | 0.031303 | 0.043883 | -0.253171 | 0.001417 |
| 4 | 1 | Sector agropecuario | A | AGRICULTURA, GANADERÍA, CAZA, SILVICULTURA Y P... | 145 | Cría de aves de corral | 800000276 | 2016 | 321430952 | 274957420.0 | ... | 0.030639 | 0.667930 | 0.001894 | 0.005704 | 33 | -0.023233 | -0.412560 | 0.087197 | -0.037415 | 0.008790 |
5 rows × 40 columns
# tipos de variables con la información completa de variables, cluster k means y componentes principales
df_pca_cluster.dtypes
Macrosector int64 Desc_Macrosector object CATEGORIA object Descripción object SUBCATEGORIA int64 NOMBRE object NIT int64 PERIODO int64 Ingresos_operacionales int64 Costo_ventas float64 Utilidad_bruta int64 Margen_Bruto float64 Otros_ingresos int64 Costos_distribución float64 Gastos_admon float64 Otros_gastos float64 Otras_ganancias_operdidas float64 Utilidad_operativa int64 Margen_operativo float64 Ing_financieros float64 Cto_financieros float64 Utilidad_antes_impuestos int64 Margen_antes_impuestos float64 Gasto_impuestos float64 Utilidad_Neta int64 Margen_Neto float64 Activo int64 Pasivo int64 Patrimonio int64 Ebitda_ int64 Margen_Ebitda float64 Indice_endeudamiento float64 ROA float64 ROE float64 Cluster int32 PCA_1 float64 PCA_2 float64 PCA_3 float64 PCA_4 float64 PCA_5 float64 dtype: object
df_pca_cluster.columns
Index(['Macrosector', 'Desc_Macrosector', 'CATEGORIA', 'Descripción',
'SUBCATEGORIA', 'NOMBRE', 'NIT', 'PERIODO', 'Ingresos_operacionales',
'Costo_ventas', 'Utilidad_bruta', 'Margen_Bruto', 'Otros_ingresos',
'Costos_distribución', 'Gastos_admon', 'Otros_gastos',
'Otras_ganancias_operdidas', 'Utilidad_operativa', 'Margen_operativo',
'Ing_financieros', 'Cto_financieros', 'Utilidad_antes_impuestos',
'Margen_antes_impuestos', 'Gasto_impuestos', 'Utilidad_Neta',
'Margen_Neto', 'Activo', 'Pasivo', 'Patrimonio', 'Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE', 'Cluster',
'PCA_1', 'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5'],
dtype='object')
norm
['Margen_Bruto', 'Margen_operativo', 'Margen_antes_impuestos', 'Margen_Neto', 'Ebitda_', 'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE']
norm2=['Margen_Bruto',
'Margen_operativo',
'Margen_antes_impuestos',
'Margen_Neto',
'Ebitda_',
'Margen_Ebitda',
'Indice_endeudamiento',
'ROA',
'ROE','PCA_1',
'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5']
# matriz de correlación de componentes principales
corr_mat2 =df_pca_cluster[norm2].corr()
for x in range(len(norm2)):
corr_mat2.iloc[x,x] = 0.0
corr_mat2.iloc[9:,:9]
| Margen_Bruto | Margen_operativo | Margen_antes_impuestos | Margen_Neto | Ebitda_ | Margen_Ebitda | Indice_endeudamiento | ROA | ROE | |
|---|---|---|---|---|---|---|---|---|---|
| PCA_1 | 0.000672 | -0.975280 | -0.992726 | -0.633725 | -0.004475 | -0.939319 | -0.001317 | -0.002958 | -0.002114 |
| PCA_2 | -0.009947 | 0.002406 | 0.002458 | 0.001611 | -0.717757 | 0.002306 | 0.001138 | -0.944668 | -0.934500 |
| PCA_3 | -0.707698 | -0.001009 | -0.000911 | 0.001217 | -0.033280 | -0.000248 | 0.710331 | 0.019998 | 0.013740 |
| PCA_4 | -0.706044 | 0.000101 | 0.000267 | 0.000598 | 0.000213 | -0.000325 | -0.703537 | -0.000750 | 0.007254 |
| PCA_5 | -0.001460 | 0.186408 | 0.046866 | -0.772538 | 0.006333 | 0.278110 | 0.000703 | -0.002327 | -0.002537 |
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(corr_mat2.iloc[9:,:9],square=True,cbar = True, annot = True, annot_kws = {'size': 10} );
# Cluster y cantidad de periodos por cada nit
Paglo_cluster=df_pca_cluster[["NIT","Cluster","PERIODO"]].groupby(["NIT","Cluster"],as_index=False).count()
Paglo_cluster.rename(columns={'PERIODO':'CANTIDAD_PERIODOS'},
inplace=True)
Paglo_cluster.tail(30)
| NIT | Cluster | CANTIDAD_PERIODOS | |
|---|---|---|---|
| 15391 | 901003985 | 0 | 2 |
| 15392 | 901003985 | 9 | 2 |
| 15393 | 901014567 | 9 | 4 |
| 15394 | 901033073 | 0 | 1 |
| 15395 | 901033073 | 28 | 1 |
| 15396 | 901033073 | 39 | 2 |
| 15397 | 901033282 | 0 | 2 |
| 15398 | 901033282 | 9 | 1 |
| 15399 | 901033282 | 29 | 1 |
| 15400 | 901033316 | 0 | 3 |
| 15401 | 901033316 | 9 | 1 |
| 15402 | 901034604 | 9 | 2 |
| 15403 | 901034604 | 36 | 2 |
| 15404 | 901035348 | 0 | 1 |
| 15405 | 901035348 | 28 | 1 |
| 15406 | 901035348 | 39 | 2 |
| 15407 | 901035576 | 18 | 3 |
| 15408 | 901035576 | 32 | 1 |
| 15409 | 901035582 | 9 | 3 |
| 15410 | 901035582 | 22 | 1 |
| 15411 | 901038787 | 18 | 3 |
| 15412 | 901038787 | 28 | 1 |
| 15413 | 901038968 | 9 | 4 |
| 15414 | 901042983 | 18 | 4 |
| 15415 | 901061715 | 9 | 4 |
| 15416 | 901085146 | 9 | 1 |
| 15417 | 901085146 | 46 | 3 |
| 15418 | 901107672 | 0 | 1 |
| 15419 | 901107672 | 9 | 2 |
| 15420 | 901107672 | 29 | 1 |
pyclus.columns
Index(['Macrosector', 'Desc_Macrosector', 'CATEGORIA', 'Descripción',
'SUBCATEGORIA', 'NOMBRE', 'NIT', 'PERIODO', 'Ingresos_operacionales',
'Costo_ventas', 'Utilidad_bruta', 'Margen_Bruto', 'Otros_ingresos',
'Costos_distribución', 'Gastos_admon', 'Otros_gastos',
'Otras_ganancias_operdidas', 'Utilidad_operativa', 'Margen_operativo',
'Ing_financieros', 'Cto_financieros', 'Utilidad_antes_impuestos',
'Margen_antes_impuestos', 'Gasto_impuestos', 'Utilidad_Neta',
'Margen_Neto', 'Activo', 'Pasivo', 'Patrimonio', 'Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE', 'Cluster'],
dtype='object')
norm3=["Cluster",'Ingresos_operacionales',
'Costo_ventas', 'Utilidad_bruta', 'Margen_Bruto', 'Otros_ingresos',
'Costos_distribución', 'Gastos_admon', 'Otros_gastos',
'Otras_ganancias_operdidas', 'Utilidad_operativa', 'Margen_operativo',
'Ing_financieros', 'Cto_financieros', 'Utilidad_antes_impuestos',
'Margen_antes_impuestos', 'Gasto_impuestos', 'Utilidad_Neta',
'Margen_Neto', 'Activo', 'Pasivo', 'Patrimonio', 'Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE','PCA_1',
'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5']
# Estadistica y percentiles para los 50 cluster y sus componentes principales
pymes_estadistica_clus=df_pca_cluster[norm3].groupby('Cluster').describe(percentiles=[0.75,0.9])
pymes_estadistica_clus
| Ingresos_operacionales | Costo_ventas | ... | PCA_4 | PCA_5 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 50% | 75% | 90% | max | count | mean | ... | 90% | max | count | mean | std | min | 50% | 75% | 90% | max | |
| Cluster | |||||||||||||||||||||
| 0 | 528.0 | 2.229837e+07 | 4.918107e+07 | 0.0 | 8260392.0 | 2.170017e+07 | 5.150605e+07 | 6.393797e+08 | 528.0 | 1.626442e+07 | ... | 0.097170 | 0.470432 | 528.0 | -0.000960 | 0.003543 | -0.018576 | -0.000204 | 0.001522 | 0.002708 | 0.005845 |
| 1 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 1.000000e+00 | ... | -0.126627 | -0.126627 | 1.0 | -53.892215 | NaN | -53.892215 | -53.892215 | -53.892215 | -53.892215 | -53.892215 |
| 2 | 1.0 | 1.566311e+07 | NaN | 15663111.0 | 15663111.0 | 1.566311e+07 | 1.566311e+07 | 1.566311e+07 | 1.0 | 1.342816e+07 | ... | -1.362815 | -1.362815 | 1.0 | 0.352298 | NaN | 0.352298 | 0.352298 | 0.352298 | 0.352298 | 0.352298 |
| 3 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | -0.124460 | -0.124460 | 1.0 | 136.261960 | NaN | 136.261960 | 136.261960 | 136.261960 | 136.261960 | 136.261960 |
| 4 | 4.0 | 0.000000e+00 | 0.000000e+00 | 0.0 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.0 | 0.000000e+00 | ... | -36.252707 | -36.240382 | 4.0 | 0.045463 | 0.001214 | 0.044361 | 0.045413 | 0.046430 | 0.046571 | 0.046666 |
| 5 | 1.0 | 7.099000e+03 | NaN | 7099.0 | 7099.0 | 7.099000e+03 | 7.099000e+03 | 7.099000e+03 | 1.0 | 4.912179e+06 | ... | 117.949589 | 117.949589 | 1.0 | 0.209220 | NaN | 0.209220 | 0.209220 | 0.209220 | 0.209220 | 0.209220 |
| 6 | 1.0 | 3.300000e+02 | NaN | 330.0 | 330.0 | 3.300000e+02 | 3.300000e+02 | 3.300000e+02 | 1.0 | 7.522800e+04 | ... | 38.452993 | 38.452993 | 1.0 | 0.069903 | NaN | 0.069903 | 0.069903 | 0.069903 | 0.069903 | 0.069903 |
| 7 | 2.0 | 1.132148e+07 | 9.545794e+06 | 4571580.0 | 11321475.5 | 1.469642e+07 | 1.672139e+07 | 1.807137e+07 | 2.0 | 7.452510e+06 | ... | -0.081650 | -0.067725 | 2.0 | -0.424560 | 0.031611 | -0.446912 | -0.424560 | -0.413383 | -0.406678 | -0.402207 |
| 8 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | 0.164727 | 0.164727 | 1.0 | -29.876716 | NaN | -29.876716 | -29.876716 | -29.876716 | -29.876716 | -29.876716 |
| 9 | 1.0 | 4.054500e+04 | NaN | 40545.0 | 40545.0 | 4.054500e+04 | 4.054500e+04 | 4.054500e+04 | 1.0 | 1.003700e+04 | ... | -89.467202 | -89.467202 | 1.0 | 0.106389 | NaN | 0.106389 | 0.106389 | 0.106389 | 0.106389 | 0.106389 |
| 10 | 12.0 | 1.442477e+05 | 2.744580e+05 | 0.0 | 0.0 | 9.007375e+04 | 6.417679e+05 | 7.459900e+05 | 12.0 | 9.119833e+04 | ... | -6.369536 | -6.116882 | 12.0 | 0.010789 | 0.001952 | 0.008686 | 0.010351 | 0.011319 | 0.011715 | 0.016168 |
| 11 | 3169.0 | 3.597436e+06 | 6.138097e+06 | 432.0 | 1417500.0 | 4.083792e+06 | 9.533026e+06 | 8.986691e+07 | 3169.0 | 5.402153e+05 | ... | 0.074708 | 0.125655 | 3169.0 | 0.001249 | 0.001550 | -0.067881 | 0.001179 | 0.001599 | 0.002158 | 0.017512 |
| 12 | 2.0 | 6.000000e+00 | 7.071068e+00 | 1.0 | 6.0 | 8.500000e+00 | 1.000000e+01 | 1.100000e+01 | 2.0 | 0.000000e+00 | ... | 0.103227 | 0.156960 | 2.0 | -12.331767 | 1.394867 | -13.318087 | -12.331767 | -11.838607 | -11.542711 | -11.345447 |
| 13 | 11.0 | 5.274738e+08 | 1.509895e+09 | 0.0 | 17680168.0 | 7.501420e+07 | 5.148430e+08 | 5.057506e+09 | 11.0 | 3.656019e+08 | ... | 0.028196 | 0.194452 | 11.0 | 0.024395 | 0.053297 | -0.035565 | 0.023726 | 0.075579 | 0.086426 | 0.089440 |
| 14 | 11.0 | 5.496647e+05 | 1.150660e+06 | 500.0 | 37608.0 | 2.766895e+05 | 1.585133e+06 | 3.715882e+06 | 11.0 | 1.640991e+07 | ... | 9.578915 | 9.789944 | 11.0 | 0.004652 | 0.021862 | -0.039562 | 0.012427 | 0.016201 | 0.020002 | 0.024057 |
| 15 | 9.0 | 8.368928e+06 | 1.198719e+07 | 0.0 | 3516915.0 | 9.977017e+06 | 2.785120e+07 | 3.004307e+07 | 9.0 | 1.624583e+07 | ... | 0.443296 | 0.933976 | 9.0 | -0.023832 | 0.090715 | -0.155046 | 0.031730 | 0.036089 | 0.043083 | 0.055827 |
| 16 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | 0.196971 | 0.196971 | 1.0 | -46.772675 | NaN | -46.772675 | -46.772675 | -46.772675 | -46.772675 | -46.772675 |
| 17 | 7584.0 | 2.036908e+07 | 2.593279e+07 | 0.0 | 12624997.5 | 2.294116e+07 | 4.398682e+07 | 3.588621e+08 | 7584.0 | 1.625774e+07 | ... | 0.008459 | 0.710820 | 7584.0 | 0.001916 | 0.000728 | -0.002458 | 0.001805 | 0.002226 | 0.002752 | 0.034179 |
| 18 | 1.0 | 3.866330e+05 | NaN | 386633.0 | 386633.0 | 3.866330e+05 | 3.866330e+05 | 3.866330e+05 | 1.0 | 2.255800e+05 | ... | -18.451592 | -18.451592 | 1.0 | 0.023295 | NaN | 0.023295 | 0.023295 | 0.023295 | 0.023295 | 0.023295 |
| 19 | 3.0 | 3.603733e+04 | 3.182126e+04 | 500.0 | 45718.0 | 5.380600e+04 | 5.865880e+04 | 6.189400e+04 | 3.0 | 2.991908e+06 | ... | 16.086244 | 16.426443 | 3.0 | 0.041322 | 0.014787 | 0.027543 | 0.039480 | 0.048212 | 0.053451 | 0.056944 |
| 20 | 1.0 | 7.948424e+07 | NaN | 79484241.0 | 79484241.0 | 7.948424e+07 | 7.948424e+07 | 7.948424e+07 | 1.0 | 2.354660e+07 | ... | -16.858341 | -16.858341 | 1.0 | -0.078595 | NaN | -0.078595 | -0.078595 | -0.078595 | -0.078595 | -0.078595 |
| 21 | 122.0 | 1.097347e+08 | 1.552832e+08 | 0.0 | 44537363.5 | 1.440166e+08 | 3.371719e+08 | 8.312287e+08 | 122.0 | 6.968243e+07 | ... | 0.191221 | 0.338190 | 122.0 | 0.026729 | 0.026277 | -0.014223 | 0.024232 | 0.029037 | 0.035294 | 0.226995 |
| 22 | 58.0 | 3.386742e+06 | 1.078152e+07 | 0.0 | 252303.5 | 1.791099e+06 | 8.070874e+06 | 7.620913e+07 | 58.0 | 2.613058e+06 | ... | -1.673889 | -1.481959 | 58.0 | 0.001421 | 0.012113 | -0.063234 | 0.004024 | 0.004449 | 0.005364 | 0.007501 |
| 23 | 2.0 | 1.000000e+00 | 0.000000e+00 | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.0 | 0.000000e+00 | ... | 0.134382 | 0.160540 | 2.0 | 4.101397 | 0.632300 | 3.654294 | 4.101397 | 4.324949 | 4.459080 | 4.548500 |
| 24 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | -0.128744 | -0.128744 | 1.0 | -59.297911 | NaN | -59.297911 | -59.297911 | -59.297911 | -59.297911 | -59.297911 |
| 25 | 7582.0 | 1.526769e+07 | 1.975023e+07 | 0.0 | 10221239.5 | 1.783942e+07 | 3.065473e+07 | 4.766627e+08 | 7582.0 | 1.170135e+07 | ... | 0.137975 | 0.735789 | 7582.0 | 0.001761 | 0.001025 | -0.066683 | 0.001647 | 0.002070 | 0.002630 | 0.008907 |
| 26 | 1.0 | 5.000000e+02 | NaN | 500.0 | 500.0 | 5.000000e+02 | 5.000000e+02 | 5.000000e+02 | 1.0 | 1.500130e+05 | ... | 51.093162 | 51.093162 | 1.0 | 0.121768 | NaN | 0.121768 | 0.121768 | 0.121768 | 0.121768 | 0.121768 |
| 27 | 5.0 | 3.284879e+09 | 4.483398e+09 | 10217838.0 | 90319862.0 | 7.316126e+09 | 8.320491e+09 | 8.990068e+09 | 5.0 | 1.582144e+09 | ... | 0.160146 | 0.254420 | 5.0 | 0.138149 | 0.066231 | 0.020335 | 0.164548 | 0.169440 | 0.174434 | 0.177763 |
| 28 | 3.0 | 1.565475e+09 | 2.667018e+09 | 16288885.0 | 35072677.0 | 2.340068e+09 | 3.723065e+09 | 4.645063e+09 | 3.0 | 1.342187e+09 | ... | 1.635683 | 2.027757 | 3.0 | -0.400710 | 0.272995 | -0.609188 | -0.501237 | -0.296472 | -0.173612 | -0.091706 |
| 29 | 386.0 | 7.686875e+06 | 1.796829e+07 | 0.0 | 3169896.0 | 8.107447e+06 | 1.567682e+07 | 2.366876e+08 | 386.0 | 5.698031e+06 | ... | -0.316803 | 0.124945 | 386.0 | 0.000764 | 0.005906 | -0.108878 | 0.001708 | 0.002015 | 0.002262 | 0.007324 |
| 30 | 23.0 | 3.996164e+07 | 5.317572e+07 | 0.0 | 5527995.0 | 6.705837e+07 | 1.165965e+08 | 1.757003e+08 | 23.0 | 2.628312e+07 | ... | 0.128498 | 0.301979 | 23.0 | -0.058245 | 0.207822 | -0.998931 | -0.032594 | 0.016993 | 0.022150 | 0.027419 |
| 31 | 1.0 | 5.000000e+02 | NaN | 500.0 | 500.0 | 5.000000e+02 | 5.000000e+02 | 5.000000e+02 | 1.0 | 7.682000e+04 | ... | 25.954058 | 25.954058 | 1.0 | 0.041418 | NaN | 0.041418 | 0.041418 | 0.041418 | 0.041418 | 0.041418 |
| 32 | 7.0 | 4.876713e+05 | 6.856835e+05 | 0.0 | 134000.0 | 7.106180e+05 | 1.201354e+06 | 1.858013e+06 | 7.0 | 2.342844e+05 | ... | -10.306447 | -9.704090 | 7.0 | 0.014355 | 0.003603 | 0.006678 | 0.015579 | 0.015919 | 0.016888 | 0.017842 |
| 33 | 1953.0 | 5.878685e+07 | 6.059761e+07 | 0.0 | 36362426.0 | 7.703930e+07 | 1.385746e+08 | 4.782311e+08 | 1953.0 | 4.399617e+07 | ... | 0.036691 | 0.340840 | 1953.0 | 0.004866 | 0.005380 | -0.004735 | 0.004464 | 0.005566 | 0.007229 | 0.227805 |
| 34 | 2.0 | 1.000000e+00 | 0.000000e+00 | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.0 | 0.000000e+00 | ... | -0.006688 | 0.001704 | 2.0 | -3.967691 | 1.796897 | -5.238289 | -3.967691 | -3.332392 | -2.951212 | -2.697093 |
| 35 | 480.0 | 7.947728e+07 | 1.246469e+08 | 0.0 | 42440078.0 | 9.863794e+07 | 1.721347e+08 | 1.093382e+09 | 480.0 | 5.399060e+07 | ... | 0.191647 | 0.508223 | 480.0 | 0.012047 | 0.007541 | -0.003957 | 0.011083 | 0.013603 | 0.016280 | 0.100375 |
| 36 | 29.0 | 3.244788e+05 | 5.180651e+05 | 1722.0 | 64561.0 | 2.918830e+05 | 1.065416e+06 | 2.126292e+06 | 29.0 | 3.457109e+06 | ... | 2.573542 | 3.843592 | 29.0 | -0.000172 | 0.009172 | -0.031647 | 0.001089 | 0.004095 | 0.008054 | 0.015312 |
| 37 | 37.0 | 3.278744e+08 | 1.006483e+09 | 0.0 | 18906605.0 | 1.851708e+08 | 6.724267e+08 | 6.035022e+09 | 37.0 | 2.465122e+08 | ... | 0.218566 | 0.309543 | 37.0 | 0.037622 | 0.030915 | -0.021096 | 0.045354 | 0.056578 | 0.067360 | 0.091432 |
| 38 | 3.0 | 1.630902e+06 | 2.733514e+06 | 51325.0 | 54089.0 | 2.420690e+06 | 3.840651e+06 | 4.787291e+06 | 3.0 | 3.381558e+06 | ... | -0.280979 | -0.211948 | 3.0 | 0.084020 | 0.007777 | 0.078814 | 0.080287 | 0.086623 | 0.090425 | 0.092960 |
| 39 | 4698.0 | 7.828213e+06 | 1.155192e+07 | 0.0 | 5618936.0 | 1.055182e+07 | 1.750919e+07 | 3.099623e+08 | 4698.0 | 5.953516e+06 | ... | 0.338717 | 0.879811 | 4698.0 | 0.001344 | 0.001365 | -0.024904 | 0.001292 | 0.001613 | 0.002093 | 0.068250 |
| 40 | 1.0 | 0.000000e+00 | NaN | 0.0 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.0 | 0.000000e+00 | ... | -17.232112 | -17.232112 | 1.0 | 0.056317 | NaN | 0.056317 | 0.056317 | 0.056317 | 0.056317 | 0.056317 |
| 41 | 3.0 | 2.162574e+09 | 3.622069e+09 | 0.0 | 143563931.0 | 3.243861e+09 | 5.104038e+09 | 6.344157e+09 | 3.0 | 1.501287e+09 | ... | 0.245442 | 0.251575 | 3.0 | 0.242695 | 0.017944 | 0.230994 | 0.233737 | 0.248546 | 0.257431 | 0.263355 |
| 42 | 5022.0 | 1.667934e+07 | 4.955853e+07 | 0.0 | 8199170.5 | 1.808698e+07 | 3.641804e+07 | 2.039422e+09 | 5022.0 | 1.343762e+07 | ... | -0.116016 | 0.252303 | 5022.0 | 0.001729 | 0.005176 | -0.355200 | 0.001762 | 0.002076 | 0.002621 | 0.009506 |
| 43 | 4200.0 | 1.476666e+06 | 2.899285e+06 | 1.0 | 730358.5 | 1.634698e+06 | 3.500970e+06 | 1.048729e+08 | 4200.0 | 1.096053e+05 | ... | 0.191428 | 0.288459 | 4200.0 | 0.000286 | 0.023863 | -1.172605 | 0.000959 | 0.001373 | 0.002064 | 0.052612 |
| 44 | 15.0 | 7.059871e+05 | 1.857492e+06 | 0.0 | 0.0 | 1.423310e+05 | 1.945987e+06 | 6.849903e+06 | 15.0 | 6.361672e+05 | ... | -3.914339 | -3.762464 | 15.0 | 0.008729 | 0.004939 | 0.006031 | 0.007103 | 0.008353 | 0.010968 | 0.025629 |
| 45 | 119.0 | 3.143092e+07 | 6.869153e+07 | 0.0 | 8337978.0 | 2.453622e+07 | 7.743989e+07 | 4.555676e+08 | 119.0 | 2.761482e+07 | ... | 0.315489 | 0.469075 | 119.0 | -0.008053 | 0.010896 | -0.037255 | -0.009892 | 0.003204 | 0.006730 | 0.011212 |
| 46 | 2.0 | 0.000000e+00 | 0.000000e+00 | 0.0 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.0 | 0.000000e+00 | ... | -33.164470 | -33.164470 | 2.0 | 0.040709 | 0.000004 | 0.040707 | 0.040709 | 0.040711 | 0.040712 | 0.040712 |
| 47 | 2250.0 | 5.671336e+06 | 1.022041e+07 | 17.0 | 2230405.5 | 6.780740e+06 | 1.420137e+07 | 1.767277e+08 | 2250.0 | 7.912650e+05 | ... | -0.109424 | -0.055629 | 2250.0 | 0.001623 | 0.006960 | -0.102333 | 0.001427 | 0.001829 | 0.002416 | 0.311739 |
| 48 | 125.0 | 5.104850e+06 | 1.267453e+07 | 0.0 | 702587.0 | 3.872445e+06 | 1.069429e+07 | 8.766680e+07 | 125.0 | 4.287356e+06 | ... | -0.809141 | -0.056639 | 125.0 | 0.001026 | 0.003646 | -0.014928 | 0.002303 | 0.002680 | 0.003598 | 0.007610 |
| 49 | 1529.0 | 2.438128e+07 | 3.453479e+07 | 0.0 | 13295318.0 | 2.968935e+07 | 5.805895e+07 | 3.817316e+08 | 1529.0 | 1.544221e+07 | ... | 0.210113 | 0.404832 | 1529.0 | 0.004757 | 0.008697 | -0.000623 | 0.004108 | 0.005271 | 0.006632 | 0.321779 |
50 rows × 248 columns
# se guardan los cluster con su información estadistica
pymes_estadistica_clus.to_excel('pymes_estadistica_kmeans_sratio.xlsx', sheet_name='kmeans')
# se guarda la matriz de correlacion de las componentes principales
corr_mat2.iloc[9:,:9].to_excel('correlacion_pca_N_sratio.xlsx', sheet_name='pca')
# se guardan la nueva base de datos con los cluster y los componentes principales
df_pca_cluster.to_excel('pymes_kmeans_sratios.xlsx', sheet_name='kmeans')
# Los 50 cluster ordenados según la cantidad de empresas
df_pca_cluster[["NIT","Cluster"]].groupby('Cluster').count().sort_values("NIT",ascending=False)
| NIT | |
|---|---|
| Cluster | |
| 17 | 7584 |
| 25 | 7582 |
| 42 | 5022 |
| 39 | 4698 |
| 43 | 4200 |
| 11 | 3169 |
| 47 | 2250 |
| 33 | 1953 |
| 49 | 1529 |
| 0 | 528 |
| 35 | 480 |
| 29 | 386 |
| 48 | 125 |
| 21 | 122 |
| 45 | 119 |
| 22 | 58 |
| 37 | 37 |
| 36 | 29 |
| 30 | 23 |
| 44 | 15 |
| 10 | 12 |
| 14 | 11 |
| 13 | 11 |
| 15 | 9 |
| 32 | 7 |
| 27 | 5 |
| 4 | 4 |
| 41 | 3 |
| 38 | 3 |
| 28 | 3 |
| 19 | 3 |
| 23 | 2 |
| 46 | 2 |
| 34 | 2 |
| 7 | 2 |
| 12 | 2 |
| 3 | 1 |
| 40 | 1 |
| 2 | 1 |
| 31 | 1 |
| 5 | 1 |
| 8 | 1 |
| 6 | 1 |
| 26 | 1 |
| 1 | 1 |
| 24 | 1 |
| 20 | 1 |
| 18 | 1 |
| 16 | 1 |
| 9 | 1 |
# Se crea una copia para realizar el análisis de clúster agglomerativo
df_pca_agglom=df_pca_cluster.copy()
Es un tipo de clúster jerárquico, en el cual el agrupamiento se inicia con todas las observaciones separadas, cada una formando un clúster individual. Los clústeres se van combinado a medida que la estructura crece hasta converger en uno solo (Amat, 2020).
5.3.2.1 Pasos del clustering aglomerativo (Amat, 2020)
a) Considerar cada una de las n observaciones como un clúster individual, formando así la base del dendrograma (hojas).
b) Proceso iterativo hasta que todas las observaciones pertenecen a un único clúster:
i. Calcular la distancia entre cada posible par de los n clústeres. El investigador debe determinar el tipo de medida empleada para cuantificar la similitud entre observaciones o grupos (distancia y linkage).
ii. Los dos clústeres más similares se fusionan, de forma que quedan n-1 clústeres.
c) Cortar la estructura de árbol generada (dendrograma) a una determinada altura para crear los clústeres finales.
from sklearn.cluster import AgglomerativeClustering
# Se usan 50 cluster
ag = AgglomerativeClustering(n_clusters=50, linkage='ward', compute_full_tree=True)
ag = ag.fit(scores_pca)
# se crea una nueva variable con los clúster aglomerativos
df_pca_agglom['agglom'] = ag.fit_predict(scores_pca)
# variables usadas
norm4=[ "agglom",'Ingresos_operacionales',
'Costo_ventas', 'Utilidad_bruta', 'Margen_Bruto', 'Otros_ingresos',
'Costos_distribución', 'Gastos_admon', 'Otros_gastos',
'Otras_ganancias_operdidas', 'Utilidad_operativa', 'Margen_operativo',
'Ing_financieros', 'Cto_financieros', 'Utilidad_antes_impuestos',
'Margen_antes_impuestos', 'Gasto_impuestos', 'Utilidad_Neta',
'Margen_Neto', 'Activo', 'Pasivo', 'Patrimonio', 'Ebitda_',
'Margen_Ebitda', 'Indice_endeudamiento', 'ROA', 'ROE','PCA_1',
'PCA_2', 'PCA_3', 'PCA_4', 'PCA_5']
# vista general de las primeras 5 filas y la base de datos con la nueva variable clúster aglomerativo
df_pca_agglom.head()
| Macrosector | Desc_Macrosector | CATEGORIA | Descripción | SUBCATEGORIA | NOMBRE | NIT | PERIODO | Ingresos_operacionales | Costo_ventas | ... | Indice_endeudamiento | ROA | ROE | Cluster | PCA_1 | PCA_2 | PCA_3 | PCA_4 | PCA_5 | agglom | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2016 | 310130 | 4077.0 | ... | 0.730967 | 0.004665 | 0.017341 | 47 | -0.020810 | 0.043945 | 0.012007 | -0.216527 | 0.001229 | 1 |
| 1 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2017 | 300000 | 0.0 | ... | 0.746141 | 0.009568 | 0.037689 | 47 | -0.020790 | 0.046819 | 0.018757 | -0.227293 | 0.001130 | 1 |
| 2 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2018 | 300000 | 0.0 | ... | 0.807461 | 0.003225 | 0.016750 | 47 | -0.020823 | 0.048073 | 0.053170 | -0.261793 | 0.001210 | 1 |
| 3 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2019 | 300000 | 0.0 | ... | 0.792185 | 0.013868 | 0.066731 | 47 | -0.021013 | 0.031303 | 0.043883 | -0.253171 | 0.001417 | 1 |
| 4 | 1 | Sector agropecuario | A | AGRICULTURA, GANADERÍA, CAZA, SILVICULTURA Y P... | 145 | Cría de aves de corral | 800000276 | 2016 | 321430952 | 274957420.0 | ... | 0.667930 | 0.001894 | 0.005704 | 33 | -0.023233 | -0.412560 | 0.087197 | -0.037415 | 0.008790 | 16 |
5 rows × 41 columns
# Estadistica y percentiles
pymes_estadistica_agglom=df_pca_agglom[norm4].groupby('agglom').describe(percentiles=[0.75,0.9])
pymes_estadistica_agglom
| Ingresos_operacionales | Costo_ventas | ... | PCA_4 | PCA_5 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 50% | 75% | 90% | max | count | mean | ... | 90% | max | count | mean | std | min | 50% | 75% | 90% | max | |
| agglom | |||||||||||||||||||||
| 0 | 4347.0 | 3.989769e+06 | 6.725643e+06 | 1.0 | 1444749.0 | 4.538333e+06 | 1.062998e+07 | 8.218488e+07 | 4347.0 | 4.530057e+05 | ... | 0.077043 | 0.362109 | 4347.0 | 0.001007 | 0.013137 | -0.579766 | 0.001228 | 0.001774 | 0.002658 | 0.068250 |
| 1 | 5932.0 | 1.275453e+07 | 3.495701e+07 | 0.0 | 6795398.5 | 1.462352e+07 | 2.730154e+07 | 1.791562e+09 | 5932.0 | 9.488671e+06 | ... | -0.103963 | -0.044580 | 5932.0 | 0.001725 | 0.000977 | -0.050624 | 0.001715 | 0.001956 | 0.002295 | 0.009110 |
| 2 | 237.0 | 1.072469e+08 | 1.617630e+08 | 0.0 | 47002636.0 | 1.338047e+08 | 2.786430e+08 | 1.086682e+09 | 237.0 | 7.180353e+07 | ... | 0.192693 | 0.338190 | 237.0 | 0.022122 | 0.020660 | -0.014223 | 0.018585 | 0.026206 | 0.034085 | 0.226995 |
| 3 | 303.0 | 2.701005e+07 | 5.742857e+07 | 0.0 | 9040259.0 | 2.529049e+07 | 5.743729e+07 | 4.555676e+08 | 303.0 | 2.216785e+07 | ... | 0.200064 | 0.469075 | 303.0 | -0.005132 | 0.008161 | -0.037255 | -0.004954 | 0.001780 | 0.004299 | 0.011212 |
| 4 | 86.0 | 3.127780e+06 | 9.267270e+06 | 0.0 | 323829.0 | 2.321120e+06 | 8.194252e+06 | 7.620913e+07 | 86.0 | 2.613793e+06 | ... | -1.365714 | -0.943078 | 86.0 | 0.001246 | 0.010194 | -0.063234 | 0.003768 | 0.004362 | 0.005253 | 0.007501 |
| 5 | 2.0 | 6.000000e+00 | 7.071068e+00 | 1.0 | 6.0 | 8.500000e+00 | 1.000000e+01 | 1.100000e+01 | 2.0 | 0.000000e+00 | ... | 0.103227 | 0.156960 | 2.0 | -12.331767 | 1.394867 | -13.318087 | -12.331767 | -11.838607 | -11.542711 | -11.345447 |
| 6 | 9.0 | 8.368928e+06 | 1.198719e+07 | 0.0 | 3516915.0 | 9.977017e+06 | 2.785120e+07 | 3.004307e+07 | 9.0 | 1.624583e+07 | ... | 0.443296 | 0.933976 | 9.0 | -0.023832 | 0.090715 | -0.155046 | 0.031730 | 0.036089 | 0.043083 | 0.055827 |
| 7 | 5.0 | 4.260000e+01 | 8.436113e+01 | 1.0 | 1.0 | 1.700000e+01 | 1.226000e+02 | 1.930000e+02 | 5.0 | 0.000000e+00 | ... | 0.183539 | 0.190781 | 5.0 | -1.968691 | 2.068498 | -5.238289 | -1.172605 | -0.626589 | -0.315962 | -0.108878 |
| 8 | 8.0 | 7.469206e+05 | 1.314557e+06 | 21343.0 | 55804.0 | 7.558175e+05 | 2.224358e+06 | 3.715882e+06 | 8.0 | 2.209620e+07 | ... | 7.157630 | 7.290648 | 8.0 | 0.000917 | 0.024600 | -0.039562 | 0.013556 | 0.015687 | 0.018060 | 0.020002 |
| 9 | 5463.0 | 2.969507e+07 | 4.727375e+07 | 0.0 | 16455467.0 | 3.204692e+07 | 6.895149e+07 | 2.039422e+09 | 5463.0 | 2.335395e+07 | ... | -0.025792 | 0.020017 | 5463.0 | 0.002520 | 0.001296 | -0.004373 | 0.002134 | 0.002931 | 0.004191 | 0.034179 |
| 10 | 13.0 | 1.331863e+05 | 2.657829e+05 | 0.0 | 0.0 | 5.343700e+04 | 5.926808e+05 | 7.459900e+05 | 13.0 | 8.419369e+04 | ... | -6.378097 | -6.116882 | 13.0 | 0.010472 | 0.002189 | 0.006678 | 0.010229 | 0.011304 | 0.011676 | 0.016168 |
| 11 | 1138.0 | 3.245233e+07 | 4.599706e+07 | 0.0 | 16720977.0 | 4.079429e+07 | 7.526634e+07 | 3.622867e+08 | 1138.0 | 2.207673e+07 | ... | 0.183911 | 0.254982 | 1138.0 | 0.004866 | 0.003378 | -0.000623 | 0.004576 | 0.005433 | 0.006303 | 0.088866 |
| 12 | 6.0 | 0.000000e+00 | 0.000000e+00 | 0.0 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.0 | 0.000000e+00 | ... | -33.164470 | -33.164470 | 6.0 | 0.043879 | 0.002629 | 0.040707 | 0.044419 | 0.045882 | 0.046508 | 0.046666 |
| 13 | 3.0 | 1.565475e+09 | 2.667018e+09 | 16288885.0 | 35072677.0 | 2.340068e+09 | 3.723065e+09 | 4.645063e+09 | 3.0 | 1.342187e+09 | ... | 1.635683 | 2.027757 | 3.0 | -0.400710 | 0.272995 | -0.609188 | -0.501237 | -0.296472 | -0.173612 | -0.091706 |
| 14 | 12.0 | 2.529538e+05 | 4.389018e+05 | 1860.0 | 46189.0 | 3.245730e+05 | 5.153250e+05 | 1.524012e+06 | 12.0 | 4.484263e+06 | ... | 3.729830 | 4.079642 | 12.0 | 0.001583 | 0.011779 | -0.031647 | 0.003824 | 0.008083 | 0.009575 | 0.015312 |
| 15 | 1.0 | 7.948424e+07 | NaN | 79484241.0 | 79484241.0 | 7.948424e+07 | 7.948424e+07 | 7.948424e+07 | 1.0 | 2.354660e+07 | ... | -16.858341 | -16.858341 | 1.0 | -0.078595 | NaN | -0.078595 | -0.078595 | -0.078595 | -0.078595 | -0.078595 |
| 16 | 825.0 | 6.789081e+07 | 9.285153e+07 | 0.0 | 39788065.0 | 8.434260e+07 | 1.508304e+08 | 1.093382e+09 | 825.0 | 4.605395e+07 | ... | 0.196264 | 0.508223 | 825.0 | 0.008784 | 0.006173 | -0.004735 | 0.008341 | 0.010162 | 0.012097 | 0.100375 |
| 17 | 15.0 | 7.059871e+05 | 1.857492e+06 | 0.0 | 0.0 | 1.423310e+05 | 1.945987e+06 | 6.849903e+06 | 15.0 | 6.361672e+05 | ... | -3.914339 | -3.762464 | 15.0 | 0.008729 | 0.004939 | 0.006031 | 0.007103 | 0.008353 | 0.010968 | 0.025629 |
| 18 | 7852.0 | 1.995974e+07 | 2.637316e+07 | 0.0 | 12359209.5 | 2.245883e+07 | 4.125111e+07 | 4.766627e+08 | 7852.0 | 1.566820e+07 | ... | 0.094882 | 0.232194 | 7852.0 | 0.002141 | 0.005748 | -0.066683 | 0.001800 | 0.002375 | 0.003168 | 0.321779 |
| 19 | 2.0 | 1.132148e+07 | 9.545794e+06 | 4571580.0 | 11321475.5 | 1.469642e+07 | 1.672139e+07 | 1.807137e+07 | 2.0 | 7.452510e+06 | ... | -0.081650 | -0.067725 | 2.0 | -0.424560 | 0.031611 | -0.446912 | -0.424560 | -0.413383 | -0.406678 | -0.402207 |
| 20 | 5901.0 | 1.095090e+07 | 1.393647e+07 | 0.0 | 8014706.0 | 1.353063e+07 | 2.206899e+07 | 3.099623e+08 | 5901.0 | 8.036862e+06 | ... | 0.226155 | 0.676265 | 5901.0 | 0.001620 | 0.000690 | -0.002100 | 0.001501 | 0.001898 | 0.002435 | 0.008907 |
| 21 | 11.0 | 5.274738e+08 | 1.509895e+09 | 0.0 | 17680168.0 | 7.501420e+07 | 5.148430e+08 | 5.057506e+09 | 11.0 | 3.656019e+08 | ... | 0.028196 | 0.194452 | 11.0 | 0.024395 | 0.053297 | -0.035565 | 0.023726 | 0.075579 | 0.086426 | 0.089440 |
| 22 | 1240.0 | 1.843691e+07 | 3.630848e+07 | 0.0 | 8325157.5 | 1.829830e+07 | 4.281912e+07 | 6.393797e+08 | 1240.0 | 1.345132e+07 | ... | 0.036920 | 0.208557 | 1240.0 | 0.000609 | 0.001281 | -0.012987 | 0.000842 | 0.001319 | 0.001711 | 0.005648 |
| 23 | 3.0 | 3.603733e+04 | 3.182126e+04 | 500.0 | 45718.0 | 5.380600e+04 | 5.865880e+04 | 6.189400e+04 | 3.0 | 2.991908e+06 | ... | 16.086244 | 16.426443 | 3.0 | 0.041322 | 0.014787 | 0.027543 | 0.039480 | 0.048212 | 0.053451 | 0.056944 |
| 24 | 1717.0 | 4.565299e+06 | 8.413990e+06 | 0.0 | 1155467.0 | 6.325823e+06 | 1.266503e+07 | 1.107906e+08 | 1717.0 | 3.555423e+06 | ... | 0.359692 | 0.588190 | 1717.0 | 0.000731 | 0.002526 | -0.055605 | 0.001141 | 0.001290 | 0.001581 | 0.021174 |
| 25 | 2.0 | 1.000000e+00 | 0.000000e+00 | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.0 | 0.000000e+00 | ... | 0.134382 | 0.160540 | 2.0 | 4.101397 | 0.632300 | 3.654294 | 4.101397 | 4.324949 | 4.459080 | 4.548500 |
| 26 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | -0.128744 | -0.128744 | 1.0 | -59.297911 | NaN | -59.297911 | -59.297911 | -59.297911 | -59.297911 | -59.297911 |
| 27 | 1.0 | 7.099000e+03 | NaN | 7099.0 | 7099.0 | 7.099000e+03 | 7.099000e+03 | 7.099000e+03 | 1.0 | 4.912179e+06 | ... | 117.949589 | 117.949589 | 1.0 | 0.209220 | NaN | 0.209220 | 0.209220 | 0.209220 | 0.209220 | 0.209220 |
| 28 | 616.0 | 7.916330e+06 | 1.738029e+07 | 0.0 | 3030218.5 | 8.636899e+06 | 1.841835e+07 | 2.366876e+08 | 616.0 | 5.678439e+06 | ... | -0.268822 | -0.127905 | 616.0 | 0.001088 | 0.001608 | -0.007498 | 0.001570 | 0.001918 | 0.002188 | 0.007324 |
| 29 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | 0.164727 | 0.164727 | 1.0 | -29.876716 | NaN | -29.876716 | -29.876716 | -29.876716 | -29.876716 | -29.876716 |
| 30 | 1.0 | 5.000000e+02 | NaN | 500.0 | 500.0 | 5.000000e+02 | 5.000000e+02 | 5.000000e+02 | 1.0 | 7.682000e+04 | ... | 25.954058 | 25.954058 | 1.0 | 0.041418 | NaN | 0.041418 | 0.041418 | 0.041418 | 0.041418 | 0.041418 |
| 31 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | -0.124460 | -0.124460 | 1.0 | 136.261960 | NaN | 136.261960 | 136.261960 | 136.261960 | 136.261960 | 136.261960 |
| 32 | 1.0 | 3.300000e+02 | NaN | 330.0 | 330.0 | 3.300000e+02 | 3.300000e+02 | 3.300000e+02 | 1.0 | 7.522800e+04 | ... | 38.452993 | 38.452993 | 1.0 | 0.069903 | NaN | 0.069903 | 0.069903 | 0.069903 | 0.069903 | 0.069903 |
| 33 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 0.000000e+00 | ... | 0.196971 | 0.196971 | 1.0 | -46.772675 | NaN | -46.772675 | -46.772675 | -46.772675 | -46.772675 | -46.772675 |
| 34 | 22.0 | 4.170927e+07 | 5.374677e+07 | 0.0 | 10195019.5 | 6.973345e+07 | 1.168678e+08 | 1.757003e+08 | 22.0 | 2.744215e+07 | ... | 0.138066 | 0.301979 | 22.0 | -0.060619 | 0.212393 | -0.998931 | -0.033400 | 0.017040 | 0.022245 | 0.027419 |
| 35 | 1.0 | 5.000000e+02 | NaN | 500.0 | 500.0 | 5.000000e+02 | 5.000000e+02 | 5.000000e+02 | 1.0 | 1.500130e+05 | ... | 51.093162 | 51.093162 | 1.0 | 0.121768 | NaN | 0.121768 | 0.121768 | 0.121768 | 0.121768 | 0.121768 |
| 36 | 6.0 | 5.688748e+05 | 7.133054e+05 | 0.0 | 395827.5 | 7.370995e+05 | 1.310797e+06 | 1.858013e+06 | 6.0 | 2.733088e+05 | ... | -10.737353 | -10.708019 | 6.0 | 0.015635 | 0.001349 | 0.014001 | 0.015583 | 0.016085 | 0.017047 | 0.017842 |
| 37 | 4044.0 | 1.577507e+06 | 2.648187e+06 | 62.0 | 763694.5 | 1.811544e+06 | 3.865031e+06 | 5.104440e+07 | 4044.0 | 3.407735e+05 | ... | 0.224286 | 0.330946 | 4044.0 | 0.001165 | 0.002035 | -0.051087 | 0.000984 | 0.001361 | 0.001980 | 0.052612 |
| 38 | 5.0 | 3.284879e+09 | 4.483398e+09 | 10217838.0 | 90319862.0 | 7.316126e+09 | 8.320491e+09 | 8.990068e+09 | 5.0 | 1.582144e+09 | ... | 0.160146 | 0.254420 | 5.0 | 0.138149 | 0.066231 | 0.020335 | 0.164548 | 0.169440 | 0.174434 | 0.177763 |
| 39 | 1.0 | 4.054500e+04 | NaN | 40545.0 | 40545.0 | 4.054500e+04 | 4.054500e+04 | 4.054500e+04 | 1.0 | 1.003700e+04 | ... | -89.467202 | -89.467202 | 1.0 | 0.106389 | NaN | 0.106389 | 0.106389 | 0.106389 | 0.106389 | 0.106389 |
| 40 | 1.0 | 3.866330e+05 | NaN | 386633.0 | 386633.0 | 3.866330e+05 | 3.866330e+05 | 3.866330e+05 | 1.0 | 2.255800e+05 | ... | -18.451592 | -18.451592 | 1.0 | 0.023295 | NaN | 0.023295 | 0.023295 | 0.023295 | 0.023295 | 0.023295 |
| 41 | 100.0 | 4.893918e+06 | 1.285857e+07 | 0.0 | 646416.0 | 3.725985e+06 | 7.083420e+06 | 8.766680e+07 | 100.0 | 3.584987e+06 | ... | -0.759182 | -0.623129 | 100.0 | 0.001687 | 0.002050 | -0.008530 | 0.002277 | 0.002527 | 0.002814 | 0.007610 |
| 42 | 3.0 | 2.162574e+09 | 3.622069e+09 | 0.0 | 143563931.0 | 3.243861e+09 | 5.104038e+09 | 6.344157e+09 | 3.0 | 1.501287e+09 | ... | 0.245442 | 0.251575 | 3.0 | 0.242695 | 0.017944 | 0.230994 | 0.233737 | 0.248546 | 0.257431 | 0.263355 |
| 43 | 28.0 | 3.706314e+08 | 1.147444e+09 | 0.0 | 15150876.5 | 2.128747e+08 | 6.601626e+08 | 6.035022e+09 | 28.0 | 2.798013e+08 | ... | 0.230109 | 0.309543 | 28.0 | 0.035014 | 0.034013 | -0.021096 | 0.047449 | 0.059913 | 0.067273 | 0.072312 |
| 44 | 2.0 | 3.200700e+04 | 4.455763e+04 | 500.0 | 32007.0 | 4.776050e+04 | 5.721260e+04 | 6.351400e+04 | 2.0 | 1.780366e+06 | ... | 9.768841 | 9.789944 | 2.0 | 0.017072 | 0.009878 | 0.010087 | 0.017072 | 0.020565 | 0.022660 | 0.024057 |
| 45 | 1.0 | 1.000000e+00 | NaN | 1.0 | 1.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.0 | 1.000000e+00 | ... | -0.126627 | -0.126627 | 1.0 | -53.892215 | NaN | -53.892215 | -53.892215 | -53.892215 | -53.892215 | -53.892215 |
| 46 | 29.0 | 5.530323e+05 | 9.286132e+05 | 1722.0 | 64561.0 | 5.483660e+05 | 1.480986e+06 | 3.989789e+06 | 29.0 | 2.925637e+06 | ... | 1.307938 | 1.688674 | 29.0 | -0.001208 | 0.006749 | -0.020060 | 0.000678 | 0.002987 | 0.004433 | 0.013239 |
| 47 | 1.0 | 1.566311e+07 | NaN | 15663111.0 | 15663111.0 | 1.566311e+07 | 1.566311e+07 | 1.566311e+07 | 1.0 | 1.342816e+07 | ... | -1.362815 | -1.362815 | 1.0 | 0.352298 | NaN | 0.352298 | 0.352298 | 0.352298 | 0.352298 | 0.352298 |
| 48 | 1.0 | 0.000000e+00 | NaN | 0.0 | 0.0 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.0 | 0.000000e+00 | ... | -17.232112 | -17.232112 | 1.0 | 0.056317 | NaN | 0.056317 | 0.056317 | 0.056317 | 0.056317 | 0.056317 |
| 49 | 3.0 | 1.630902e+06 | 2.733514e+06 | 51325.0 | 54089.0 | 2.420690e+06 | 3.840651e+06 | 4.787291e+06 | 3.0 | 3.381558e+06 | ... | -0.280979 | -0.211948 | 3.0 | 0.084020 | 0.007777 | 0.078814 | 0.080287 | 0.086623 | 0.090425 | 0.092960 |
50 rows × 248 columns
# Se guarda la estadistica y los clúster aglomerativos
pymes_estadistica_agglom.to_excel('agglom_estadistica_sratios.xlsx', sheet_name='agglom')
# se guarda la base de datos nueva con los cluster aglomerativos
df_pca_agglom.to_excel('agglom_clust_sratios.xlsx', sheet_name='Agglom')
# Clúster aglomerativo ordenados por el numero de empresas para cada clúster
df_pca_agglom[["NIT","agglom"]].groupby('agglom').count().sort_values("NIT",ascending=False)
| NIT | |
|---|---|
| agglom | |
| 18 | 7852 |
| 1 | 5932 |
| 20 | 5901 |
| 9 | 5463 |
| 0 | 4347 |
| 37 | 4044 |
| 24 | 1717 |
| 22 | 1240 |
| 11 | 1138 |
| 16 | 825 |
| 28 | 616 |
| 3 | 303 |
| 2 | 237 |
| 41 | 100 |
| 4 | 86 |
| 46 | 29 |
| 43 | 28 |
| 34 | 22 |
| 17 | 15 |
| 10 | 13 |
| 14 | 12 |
| 21 | 11 |
| 6 | 9 |
| 8 | 8 |
| 12 | 6 |
| 36 | 6 |
| 38 | 5 |
| 7 | 5 |
| 42 | 3 |
| 49 | 3 |
| 13 | 3 |
| 23 | 3 |
| 25 | 2 |
| 19 | 2 |
| 5 | 2 |
| 44 | 2 |
| 48 | 1 |
| 47 | 1 |
| 45 | 1 |
| 15 | 1 |
| 30 | 1 |
| 31 | 1 |
| 39 | 1 |
| 26 | 1 |
| 27 | 1 |
| 35 | 1 |
| 29 | 1 |
| 33 | 1 |
| 32 | 1 |
| 40 | 1 |
# Cantidad de empresas por periodos para cada clúster aglomerativo
Paglo_agg=df_pca_agglom[["NIT","agglom","PERIODO"]].groupby(["NIT","agglom"],as_index=False).count()
Paglo_agg.rename(columns={'PERIODO':'CANTIDAD_PERIODOS'},
inplace=True)
Paglo_agg.tail(30)
| NIT | agglom | CANTIDAD_PERIODOS | |
|---|---|---|---|
| 17362 | 901014567 | 38 | 4 |
| 17363 | 901033073 | 3 | 1 |
| 17364 | 901033073 | 42 | 1 |
| 17365 | 901033073 | 49 | 2 |
| 17366 | 901033282 | 3 | 2 |
| 17367 | 901033282 | 23 | 2 |
| 17368 | 901033316 | 3 | 3 |
| 17369 | 901033316 | 23 | 1 |
| 17370 | 901034604 | 3 | 2 |
| 17371 | 901034604 | 38 | 2 |
| 17372 | 901035348 | 3 | 1 |
| 17373 | 901035348 | 42 | 1 |
| 17374 | 901035348 | 49 | 2 |
| 17375 | 901035576 | 10 | 1 |
| 17376 | 901035576 | 13 | 1 |
| 17377 | 901035576 | 31 | 1 |
| 17378 | 901035576 | 39 | 1 |
| 17379 | 901035582 | 3 | 3 |
| 17380 | 901035582 | 8 | 1 |
| 17381 | 901038787 | 31 | 4 |
| 17382 | 901038968 | 3 | 4 |
| 17383 | 901042983 | 13 | 1 |
| 17384 | 901042983 | 31 | 3 |
| 17385 | 901061715 | 3 | 1 |
| 17386 | 901061715 | 38 | 3 |
| 17387 | 901085146 | 38 | 1 |
| 17388 | 901085146 | 44 | 2 |
| 17389 | 901085146 | 48 | 1 |
| 17390 | 901107672 | 23 | 2 |
| 17391 | 901107672 | 38 | 2 |
# Se guarda el archivo con las empresas por periodo por clúster
Paglo_agg.to_excel('agglom_50_nit.xlsx', sheet_name='cantidad')
df_cluster_k_agg =pd.read_excel('./agglom_clust_sratios.xlsx')
df_cluster_k_agg.head()
| Unnamed: 0 | Macrosector | Desc_Macrosector | CATEGORIA | Descripción | SUBCATEGORIA | NOMBRE | NIT | PERIODO | Ingresos_operacionales | ... | Indice_endeudamiento | ROA | ROE | Cluster | PCA_1 | PCA_2 | PCA_3 | PCA_4 | PCA_5 | agglom | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2016 | 310130 | ... | 0.730967 | 0.004665 | 0.017341 | 47 | -0.020810 | 0.043945 | 0.012007 | -0.216527 | 0.001229 | 1 |
| 1 | 1 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2017 | 300000 | ... | 0.746141 | 0.009568 | 0.037689 | 47 | -0.020790 | 0.046819 | 0.018757 | -0.227293 | 0.001130 | 1 |
| 2 | 2 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2018 | 300000 | ... | 0.807461 | 0.003225 | 0.016750 | 47 | -0.020823 | 0.048073 | 0.053170 | -0.261793 | 0.001210 | 1 |
| 3 | 3 | 2 | Sector de servicios | L | ACTIVIDADES INMOBILIARIAS | 6810 | Actividades inmobiliarias realizadas con bien... | 800000268 | 2019 | 300000 | ... | 0.792185 | 0.013868 | 0.066731 | 47 | -0.021013 | 0.031303 | 0.043883 | -0.253171 | 0.001417 | 1 |
| 4 | 4 | 1 | Sector agropecuario | A | AGRICULTURA, GANADERÍA, CAZA, SILVICULTURA Y P... | 145 | Cría de aves de corral | 800000276 | 2016 | 321430952 | ... | 0.667930 | 0.001894 | 0.005704 | 33 | -0.023233 | -0.412560 | 0.087197 | -0.037415 | 0.008790 | 16 |
5 rows × 42 columns
import plotly.express as px
# grafica con los 12 cluster KMEANS con mayor cantidad de empresas
dat = df_cluster_k_agg[df_cluster_k_agg.Cluster.isin([17,25,42,39,43,11,47,33,49,0,35,29])]
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='Cluster')
fig.show()
fig.write_html("./12kmeans.html")
# grafica con los cluster KMEANS 0, 35, 49
dat = df_cluster_k_agg[df_cluster_k_agg.Cluster.isin([49,0,35])]
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='Cluster')
fig.show()
fig.write_html("./0_35_49kmeans.html")
# grafica con todos los cluster KMEANS
dat = df_cluster_k_agg
fig = px.scatter_3d(dat, x='Utilidad_bruta', y='Cto_financieros', z='Ingresos_operacionales',
color='Cluster')
fig.show()
fig.write_html("./todoskmeans.html")
# grafica con todos los cluster KMEANS
dat = df_cluster_k_agg
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='Cluster')
fig.show()
fig.write_html("./todoskmeanspca.html")
# grafica con los 12 cluster Aglomerativo con mayor cantidad de empresas
dat = df_cluster_k_agg[df_cluster_k_agg.agglom.isin([18,1,20,9,0,37,24,22,11,16,28,3])]
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='agglom')
fig.show()
fig.write_html("./12agglom.html")
# grafica con los cluster Aglomerativo 37,16,3
dat = df_cluster_k_agg[df_cluster_k_agg.agglom.isin([37,16,3])]
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='agglom')
fig.show()
fig.write_html("./37_16_3_agglom.html")
# grafica con los todos los cluster Aglomerativo
dat = df_cluster_k_agg
fig = px.scatter_3d(dat, x='Utilidad_bruta', y='Cto_financieros', z='Ingresos_operacionales',
color='agglom')
fig.show()
fig.write_html("./todosagglom.html")
# grafica con los todos los cluster Aglomerativo
dat = df_cluster_k_agg
fig = px.scatter_3d(dat, x='PCA_1', y='PCA_2', z='PCA_3',
color='agglom')
fig.show()
fig.write_html("./todosagglompca.html")